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1  Introduction 


The  purpose  of  this  paper  is  to  present  an  Item  Response  Theory  (IRT)  based  conceptualization 
of  test  bias  for  standardized  ability  tests.  By  “test  bias”  we  mean  a  formalization  of  the  intuitive 
idea  that  a  test  is  less  valid  for  one  group  of  examinees  than  for  another  group  in  its  attempt  to 
assess  examinee  differences  in  a  prescribed  latent  trait,  such  as  mathematics 

that  test  bias  is  the  result  of  individually-biased  items  acting  in  concert  through  a  test  scoring 
method,  such  as  number  correct,  to  produce  a  biased  test.  In  a  subsequent  paper  of  ours,  this  new 
conceptualization  of  test  bias  is  used  to  undergird  a  new  statistical  test  for  psychological  test  bias 
(Shealy  and  Stout,  1990).  Also,  a  large-scale  simulation  study  (Shealy,  1989)  has  been  conducted  of 
the  performance  properties  of  this  statistical  procedure,  in  particular  as  compared  with  the  Holland 
and  Thayer  (1988)  modification  of  the  Mantel- Haenzel  test. 

We  mention  three  distinct  features  of  the  conceptualization  of  bias  presented  herein.  First,  it 
provides  a  mechanism  for  explaining  how  several  individually-biased  items  can  combine  through  a 
test  score  to  exhibit  a  coherent  and  major  biasing  influence  at  the  test  level.  In  particular,  this 
can  be  true  even  if  each  individual  item  displays  only  a  minor  amount  of  item  bias.  For  example, 
“word  problems”  on  a  “mathematics  test”  that  are  too  dependent  on  sophisticated  written  English 
comprehension  could  combine  to  produce  pervasive  test  bias  against  English-as-a-second-language 
examinees.  A  second  feature,  possible  because  of  our  multidimensional  modeling  approach,  is  that 
the  underlying  mechanism  that  produces  bias  is  addressed.  This  mechanism  lies  in  the  distinction 
made  between  the  ability  the  test  is  intended  to  measure,  called  the  target  ability,  and  other 
abilities  influencing  test  performance  that  the  test  does  not  intend  to  measure,  called  nuisance 
determinants.  Test  bias  will  be  seen  to  occur  because  of  the  presence  of  nuisance  determinants 
possessed  in  differing  amounts  by  different  examinee  groups.  Through  the  presence  of  these  nuisance 
determinants,  bias  then  is  expressed  in  one  or  more  items.  A  third  feature,  also  possible  because  of 
our  multidimensional  modeling  approach,  is  that  a  careful  distinction  is  made  between  genuine  test 
bias  and  non-bias  differences  in  examinee  group  performance  that  are  caused  by  examinee  group 
differences  in  target  ability  distributions.  It  is  important  that  the  latter  not  be  mistakenly  labeled 
as  test  bias. 

The  novelty  of  our  approach  to  bias  lies  not  so  much  with  its  recognition  of  the  role  of  nuisance 
determinants  in  the  expression  of  test  bias,  but  rather  in  the  explicit  multidimensional  IRT  modeling 
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of  bias,  which  in  turn  promises  a  clear  and  thorough  understanding  of  bias. 


2  An  Informal  Description  of  Test  Bias 

We  begin  with  an  informal  definition  of  test  bias. 

Definition  2.1.  Test  bias  occurs  if  the  test  under  consideration  is  measuring  a  quantity  in  addition 
fo  the  one  the  test  was  designed  to  measure,  a  quantity  that  both  groups  do  not  possess  equally.  □ 

It  is  important  to  note  that  this  notion  of  test  bias  grows  out  of  the  traditional  non-IRT  notion  of 
test  bias  based  on  differential  predictive  validity.  Papers  by  Stanley  and  Porter  (1967),  Temp  (1971), 
and  in  particular  Cleary  (1968),  exemplify  this  classical  predictive  view  of  bias.  These  studies  used 
standardized  tests  to  predict  performance  on  a  particular  task,  if  the  predictive  link  from  test  to 
task  was  different  for  the  two  studied  groups,  then  test  bias  was  suspected.  Cleary  (1968),  in  a 
seminal  paper  on  test  bias,  studied  bias  in  the  prediction  of  college  success  of  black  and  white 
students  in  integrated  colleges,  using  SAT  verbal  and  math  scores.  Her  intent  was  to  determine  if 
the  expected  first  year  GPA  (grade  point  average)  for  Whites  was  different  from  that  for  Blacks, 
after  the  two  groups  had  been  matched  on  SAT  score;  hence,  the  linear  regression  of  first  year  GPA 
on  SAT  verbal  (or  math)  score  was  separately  fit  for  both  groups  and  compared.  If  the  expected 
criterion  (GPA)  for  those  examinees  attaining  a  particular  test  score  (e.g.,  SAT  combined  score) 
were  different  across  group,  the  test  score  was  considered  a  biased  predictor  of  performance  and  test 
bias  was  deemed  to  be  present.  The  purpose  of  the  Cleary  study  was  predictive ;  the  regressions  of 
criterion  on  test  score  therein  were  compared  across  group  to  see  if  the  test  score  equitably  predicts 
the  performance  measured  by  the  criterion. 

Our  focus  shifts  hereafter  to  regressing  test  score  on  criterion.  The  purpose  of  the  reversed 
regression  is  to  corroborate  that  the  prediction  of  a  criterion  by  a  test  is  equitable  across  group, 
thereby  exposing  the  conceptual  underpinning  for  IRT  modeling  of  test  bias  -  in  particular  our 
modeling  of  test  bias.  The  regressions  of  test  score  on  criterion  are  compared  to  answer  the 
following  question:  are  the  average  test  scores  for  both  groups  the  same  after  the  groups  have  been 
matched  on  criterion  performance? 

This  shift  to  a  corroborative  point  of  view  brings  us  again  to  the  informal  definition  above. 
The  difference  across  group  in  the  regressions  of  test  score  on  criterion  (other  than  that  caused 
by  statistical  error)  is  due  to  an  undesirable  causative  factor  other  than  the  criterion;  that  is,  at 
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least  some  of  the  test  questions  must  be  measuring  something  in  addition  to  what  the  criterion 
measures.  Furthermore,  this  difference  is  due  only  to  the  undesirable  factor,  because  the  criterion 
has  been  held  equal  in  the  two  groups. 

If  in  addition  to  the  reversal  of  the  regression  variables  just  described,  the  criterion  is  now 
chosen  as  internal  to  the  test  instead  of  external  to  it,  the  concept  of  the  internal  assessment  of 
bias  results.  This  internal  criteria  becomes  the  “yardstick”  by  which  the  test  is  judged  biased  or 
not;  it  is  a  portion  of  the  test  itself.  The  implicit  assumption  is  that  the  “yardstick”  portion  of  the 
test  consists  of  items  known  to  measure  only  what  they  are  supposed  to  be  measure. 

An  example  adapted  from  Shepard  (1982)  clearly  illustrates  this  internally-assessed  bias:  a 
verbal  analogies  test  is  used  to  compare  reasoning  abilities  of  German  and  Italian  immigrants 
to  the  United  States,  the  two  populations  matched  nn  English  fluency.  However,  20%  of  the  test 
items  are  based  on  words  with  Latin  origins,  whereas  the  remainder  have  linguistic  structure  equally 
familiar  to  both  groups.  Here,  the  items  with  Latin  origin  words  are  possibly  biased.  A  reasonable 
internal  criterion  with  which  to  assess  this  bias  would  be  a  score  based  on  the  responses  to  the 
linguistically  neutral  items;  for,  it  is  assumed  that  these  items  are  validly  measuring  what  the  test 
is  intended  to  measure. 

The  internal  assessment  viewpoint  of  test  bias  can  be  clarified  by  noting  two  distinctions  between 
it  and  the  classical  test  theory  based  differential  regression  conceptualization  of  Cleary  and  others: 

(A)  The  “yardstick”  (criterion),  which  was  a  measurement  of  task  performance  (e.g.,  1st  year 
GPA  in  the  Cleary  study),  is  now  a  score  internal  to  the  test  (e.g.,  score  on  the  linguistically 
neutral  items  in  the  Shepard  example).  This  internal  criterion  is  most  often  an  aggregate 
measure  of  a  portion  of  the  test  item  responses  (typically  number  right). 

(B)  The  differential  regression  approach  used  regressions  of  the  external  criterion  on  test  score  in 
a  predictive  context.  In  internally-assessed  bias  studies,  the  responses  of  one  or  more  items 
suspected  of  bias  are  regressed  on  the  internal  criterion  as  a  corroborative  statistical  test  that 
these  “suspect”  items  are  measuring  the  same  thing  that  the  internal  criterion  is  measuring. 

This  brings  us  to  an  essential  question:  what  is  the  internal  criterion  measuring?  It  is  mea¬ 
suring  a  theoretically  postulated  psychometric  construct  that  is  intended  to  be  generalizable  to 
a  variety  of  possible  future  tasks;  i.e.,  a  latent  ability  of  an  IRT  model.  Thus  IRT  modelling  of 
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bias  becomes  appropriate.  An  example  will  illustrate:  the  SAT  math  test  is  designed  to  measure  a 
construct,  “mathematical  ability”,  which  is  intended  to  predict  an  examinee’s  future  success  in  a 
set  of  quantitatively-oriented  college  courses  that  require  a  component  of  such  ability.  Rather  than 
assessing  the  SAT  test  against  the  corresponding  set  of  criterion  measures  of  performance  in  these 
courses,  we  wish  to  assess  the  test  against  the  construct  itself;  to  do  so  we  turn  to  the  test  itself  to 
verify  that  the  proper  measurement  of  mathematical  ability  is  being  done.  The  internal  criterion 
measures  this  ability  construct,  and  internal  test  bias  is  defined  with  respect  to  this  construct. 

The  generalizability  of  performance  measurements  on  a  variety  of  tasks  to  a  single  construct,  as 
described  above,  provides  one  motivation  to  shift  to  internally  assessed  bias  studies.  An  additional 
motivation  is  the  practice  in  recent  years  of  creating  item  pools ,  large  numbers  of  items  that  are 
to  be  used  in  forming  multiple  versions  of  a  standardized  test  (see,  for  example  Hambleton  and 
Swaminathan,  1985,  Ch.  12).  A  newly  constructed  set  of  items  intended  for  inclusion  in  the  item 
pool  can  be  tested  for  bias,  relative  to  the  ability  construct  that  the  pool  is  supposedly  measuring, 
by  employing  internal  bias  detection  techniques. 

Internally  assessed  bias  studies  with  a  variety  of  test  populations  have  been  done:  Cotter  and 
Berk  (1981)  attempted  to  detect  bias  in  the  WISC-R  test  with  white  and  minority  children.  Dorans 
and  Kulick  (1983),  in  a  series  of  studies  done  at  Educational  Testing  Service,  study  the  possible 
effect  of  differential  mastery  of  written  English  between  native  born  Americans  and  English-as-a- 
second-language  Oriental  students  on  scores  of  selected  items  on  a  mathematics  “word  problem” 
test. 

Item  bias  studies  such  as  the  ones  above  usually  focus  on  single  item  at  a  time;  if  several  items 
in  these  studies  are  simultaneously  found  to  be  biased,  it  is  a  result  of  statistical  bias  procedures 
conducted  for  each  item  separately,  which  raises  delicate  questions  about  simultaneous  statistical 
inference.  Moreover,  in  a  modeling  sense,  no  causative  reasons  for  the  observed  simultaneous  bias 
are  explored  by  item  bias  studies.  This  paper  studies  a  form  of  test  bias  relative  to  an  internal 
criterion;  this  kind  of  test  bias  considers  the  set  of  test  items  acting  as  a  unit  (via  a  common  causal 
mechanism)  and  combining  through  a  test  scoring  method.  The  precise  formulation  of  test  bias 
and  a  contrast  of  it  to  item  bias  is  presented  in  Section  4. 

We  now  consider  the  question  of  test  bias  relative  to  an  internal  criterion  more  carefully.  Con¬ 
sider  a  situation  where  a  single  verbal  analogy  item  is  embedded  in  two  different  tests,  tests  M  and 
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V  say.  Test  V  is  composed  of  verbal  analogy  items,  as  intended,  and  Test  M  consists  of  mathematics 
calculation  items,  as  intended,  except  for  the  single  embedded  item.  Assume  that  each  item  in  Test 

V  does  not  contain  any  culture-dependent  material  that  may  favor  one  group.  The  embedded  verbal 
item  is  not  biased  in  Test  V,  but  the  potential  for  bias  of  this  item  is  large  in  Test  M,  because  the 
item  measures  something  other  than  the  intended- to-be-measured  mathematical  calculation  skill. 
This  illustrates  a  key  component  of  test  bias,  aptly  stated  by  Mellenbergh  (1983,  p.  294):  “An 
item  can  be  biased  in  one  set  of  items,  whereas  it  is  unbiased  in  another  set.”  Shepard  (1982)  also 
points  out  this  relativity  feature:  “. . .  if  a  test  of  spatial  reasoning  inadvertently  included  several 
vocabulary  items,  they  would  be  biased  indicators  of  the  [ability  being  measured]”  and  “. . .  in  a 
test  composed  equally  of  two  types  of  items  reflecting. . .  two  different  [ability]  constructs,  it  will 
be  a  dead  heat  to  decide  statistically  which  set  defines  the  test  [ability]  and  which  set  becomes  a 
biased  measure  of  it.” 

Implicit  in  the  above  discussion  is  the  assumption  that  a  portion  of  the  test  defines  the  internal 
criterion  by  which  the  remainder  is  measured  for  the  presence  of  bias.  A  collection  of  items  defining 
the  internal  criterion  will  be  called  a  valid  subtest.  An  informal  definition  of  a  valid  subtest  can 
now  be  given:  A  subtest  is  valid  with  respect  to  a  specified  “target”  ability  if  the  subtest  score 
is  judged  to  be  measuring  only  the  intended  target  test  ability,  i.e.,  it  stands  as  a  “proxy”  of  the 
ability  one  intends  to  measure.  More  precisely,  if  all  of  the  items  of  the  subtest  measure  only  the 
intended  ability  then  the  subtest  is  said  to  be  valid. 

There  is  a  point  about  this  definition  that  needs  mentioning.  Primarily,  the  existence  and 
identification  of  a  valid  subtest  is  an  empirical  decision  based  on  expert  opinion  or  data  at  least 
in  part  external  to  the  data  set  in  question.  Subtest  validity  cannot  be  established  based  on  the 
test  data  set  alone  nor  can  it  be  theoretically  deduced.  The  “burden  of  proof”  is  an  empirical  one 
and  lies  with  the  test  constructor.  If  all  the  items  of  a  test  depend  on  a  second  determinant  (for 
example,  if  the  responses  to  all  items  depend  on  familiarity  with  standardized  tests)  then  a  valid 
subtest  will  not  exist.  Note  that  this  is  true  even  if  the  two  groups  are  not  differentially  penalized 
by  this  dependence  of  test  items  on  familiarity  with  standardized  tests.  Thus,  the  actual  presence 
of  test  bias  is  logically  independent  of  the  existence  of  a  valid  subtest  to  be  used  for  the  assessment 
of  test  bias. 

In  our  framework,  it  must  be  assumed  that  there  is  a  valid  subtest  if  we  are  to  internally  deter* 


( 


test  bias;  otherwise,  it  is  intrinsically  nondetectable  internally.  The  responses  to  the  valid  subtest 
are  used  to  tackle  the  central  problem  in  the  identification  of  test  bias:  the  need  to  distinguish 
between  group  differences  attributable  to  the  ability  construct  intended  to  be  measured  and  that 
due  to  unwanted  ability  determinants.  Because  the  valid  subtest  is  assumed  to  measure  only  the 
desired  ability,  then  no  measures  external  to  the  test  are  required  to  assess  that  ability,  although 
to  improve  accuracy  it  may  be  beneficial  to  also  use  external  data,  especially  if  the  valid  subtest  is 
short  or  if  the  assumption  of  its  validity  is  at  all  suspect.  Matching  examinees  using  a  valid  subtest 
score  controls  for  group  differences  in  the  intended-to-be-measured  ability  and  isolates  differences 
due  to  the  unwanted  determinants.  A  more  rigorous  formulation  of  “valid  subtest”  is  set  out  in 
Section  4. 

In  these  discussions  of  test  bias  relative  to  an  internal  criterion,  multidimensionality  has  implic¬ 
itly  been  invoked;  it  is  impossible  to  discuss  test  bias  without  invoking  it.  The  informal  definition 
of  test  bias  stated  above  employs  multidimensionality:  there  is  mention  of  the  quantity  the  “test 
was  designed  to  measure”  and  one  “in  addition  to”  this  quantity.  Lord  (1980,  p.  220)  recognized 
this  in  his  discussion  of  item  bias:  “if  many  of  the  items  [in  a  test]  are  found  to  be  seriously  biased, 
it  appeals  lliu.t  m*.  items  are  no«.  strictly  uniuimcusioiial”. 

Bias  in  one  or  more  items  has  typically  been  attributed  to  special  knowledge,  unintended  to 
be  measured,  that  is  more  accessible  to  one  of  the  test-taking  groups.  Ironson,  Homan,  Willis 
and  Signer  (1984)  performed  a  bias  study  that  involved  planting  within  a  mathematics  test  math¬ 
ematics  word  problems  that  required  an  extremely  high  reading  level  to  solve  them.  They  state 
their  conclusion  that  “. . .  bias  is  sometimes  thought  of  as  a  kind  of  multidimensionality  involving 
measurement  of  a  primary  dimension  and  a  second  confounding  dimension”.  Our  viewpoint  here 
is  that  bias  is  always  the  result  of  multidimensionality. 

The  “primary  dimension”  is  referred  to  in  this  paper  as  target  ability,  because  this  is  the  ability 
the  test  intends  to  measure.  The  “confounding  dimension”  is  referred  to  as  a  nuisance  determinant. 
In  the  Shepard  verbal  analogies  example  above,  the  target  ability  is  reasoning  ability,  which  80%  of 
the  items  solely  measure,  while  the  nuisance  determinant  is  familiarity  with  Latin  linguistic  roots. 

The  full  formulation  of  test  bias  is  set  out  in  Section  4— it  involves  certain  subtleties  not  dis¬ 
cussed  here.  The  group  differences  in  ability  level  of  a  latent  nuisance  determinant  provide  a 
common  causative  mechanism  for  bias  in  any  collection  of  items  on  a  test  contaminated  with  such 
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a.  determinant.  This  is  the  single  most  important  conceptual  difference  between  the  test  bias  model 
developed  in  this  paper  and  previous  item  bias  work:  the  existence  of  a  postulated  common  latent 
cause  for  the  manifestation  of  bias  across  a  group  of  test  items. 

3  The  IRT  Model  for  Test  Responses 

Herein  we  present  the  nonparametric  multidimensional  IRT  model  underlying  our  modeling  of  test 
bias.  Consider  a  group  of  Q  of  examinees;  the  sample  of  examinees  to  take  a  test  is  considered  to 
be  drawn  at  random  from  this  population.  A  test  is  simply  a  collection  of  items;  a  test  response  of 

length  .V  is  the  corresponding  set  of  responses,  for  a  randomly-chosen  examinee  from  Q ,  denoted 

by 

£=(*/i,...,t/,v)  (3-1) 

where  the  U ,  are  random  variables  taking  on 

U  _  |  0  if  response  to  item  i  is  incorrect; 

*  1  1  if  response  is  correct. 

The  IRT  model  is  composed  of  two  components  that  generate  U_:  (11a  d-dimensional  exami¬ 
nee  ability  parameter  and  (2)  a  set  of  item  responses  functions  (IRFs),  one  for  each  item,  which 
determine  tne  probability  of  correct  response  for  the  items.  The  IRT  model  is  usually  conceived  as 
a  unidimensional  ( d  =  1)  model;  here,  a  multidimensional  ( d  >  1)  model  will  be  presumed. 

Let  us  now  further  set  notation.  The  ability  vector  is 

£=(*i . »d)  (3-2) 

for  an  arbitrary  examinee  from  Q.  A  distribution  of  £  over  Q  is  induced  by  choosing  examinees  at 
random  from  Q\  the  multivariate  random  variable  is  designated 

fi  =  (0i,. ..,©<!)  (3-3) 

Examinee  independence  is  assumed;  i.e.,  J  examinees  from  Q  have  ability  parameters 

independent  and  identically  distributed  (iid)  in  j.  Item  i's  IRF,  which  is  interpreted  as  the  proba¬ 
bility  that  an  examinee  with  ability  6  will  answer  item  i  correctly,  is  denoted: 

P,(£)  =  P[U,  =  l|fi  =  fi]s  P[Ui  =  1|£]. 


9 


Our  interpretation  of  P,(F)  is  the  sampling  one:  among  all  examinees  having  ability  9,  the  expected 
proportion  of  them  getting  item  i  correct  is  P,(9). 

The  basic  philosophy  of  the  IRT  model  is  that  a  latent  distribution  of  abilities  in  a  Group  G 
drives  the  manifest  distribution  of  item  responses.  The  fundamental  identity  relating  the  responses 
LL  to  the  examinee  group  ability  variable  Q  is 

PUL  =  w]  =  4  m  =  ill©  =  §}dF(e),  _ 

for  all  u  =  ( ui , . . . ,  u\j ),  (  each  u,  =  0  or  1 ), 

where  F(-)  is  the  cumulative  distribution  function  (cdf)  of  0.  There  are  two  fundamental  assump¬ 
tions  on  the  conditional  test  response  probability  P\IJ_  =  u  |  £]  =  P[U_  =  u  |  0  =  £]  usually 
assumed  in  IRT  modeling.  To  introduce  these,  recall  two  standard  definitions  about  ordering  in 
d-dimensional  Euclidean  space:  (i)  Let  z  and  zf  be  vectors.  Then  z_  <  zf  if  r,  <  z\  for  i  =  1  ,...d 
and  for  at  least  one  i,  zt  <  z'.  (ii)  Let  z  and  zf  be  vectors.  The  real  valued  function  f(z)  is  strictly 
monotone  if  for  any  z  <  zf ,  f(z)  <  f(z'). 

The  fundamental  IRT  assumptions  are: 

Assumption  3.1.  Local  independence  in  9:  for  every  6, 

.v 

P[U.  =  nl£l  =  J}  P[Ui  =  ut  |  0]  for  a II  Ui  =  0  or  1;  t  =  1, . . . ,  N.  (3  -  5) 

1=1 

Assumption  3.2.  Strict  monotonicity  of  IRFs:  The  item  IRFs  {P,(9fj  :  i  —  1, . . .  ,.V}  are  strictly 
monotone  in  9.  That  is,  for  any  i,  Pt{9f)  >  Pt(£)  if  9f  >  £  in  the  sense  of  (i)  above. 

It  is  convenient  to  combine  (3-4)  and  Assumption  3.1  in  the  following  manner: 

/  iV 

PILL  =n}=  I  n  P>ml~u'dF(9)  (3-6) 

Je-  i=i 

for  all  u. 

The  notion  of  the  dimensionality  d  of  U_  can  be  mathematically  formalized  but  for  the  purposes 
of  this  paper  it  is  unnecessary  to  do  so. 

Definition  3.1.  Let  U_  be  a  test  response  as  in  (3-1).  An  IRT  representation  o}U_  is  the  structure 

{ d.&F(9),{P,(9 ):  i  =  l,...,  A'}}  (3-7) 

where  (3-4),  Assumption  3.1.  and  Assumption  3.2  hold .  □ 
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In  this  paper  we  often  want  to  consider  a  test  item's  operating  characteristic  with  respect  to  a 
specific  single  component  of  £  (say  8X).  This  is  accomplished  by  “marginalizing  out'-  the  remaining 
components  in  the  9- vector  from  the  item’s  IRF,  resulting  in  the  marginal  item  response  function 
(marginal  IRF).  Conceptually,  this  IRF  is  a  unidimensional  reduction  of  the  original  one  and  can 
be  considered  as  a  unidimensional  IRF  for  the  unidimensional  ability  8X.  The  following  definition 
is  due  to  Stout  (1989). 

Definition  3.2.  Let  P(9)  be  an  IRF.  The  marginal  IRFT(8X)  of  P{9f)  with  respect  to  0j  is  defined 

by 

T(9i)  =  £’[/5(0)|01  =  £]].  □ 

The  marginal  IRF  is  essential  in  the  discussion  of  modeling  test  bias  in  Section  4,  where  a  Single 
component  9X  of  8  designated  as  the  target  ability  will  be  considered.  Because  target  ability  is  the 
ability  the  test  designer  desires  to  measure  using  the  items,  the  marginal  IRF  with  respect  to  this 
ability  is  a  useful  concept. 

In  order  for  T(8X)  to  be  an  IRF  it  must  be  strictly  monotone;  this  does  not  follow  for  the 
marginal  IRFs  of  a  test  from  the  assumptions  of  our  IRT  representation  (3-7).  However,  very  mild 
regularity  conditions  suffice  to  produce  strict  monotonicity,  as  has  been  shown  by  Stout  (1989).  To 
this  end,  we  need  the  concept  of  stochastic  ordering. 

Definition  3.3.  Let  Z  be  a  random  vector  with  distribution  indexed  by  a  parameter  j.  Z  is 
strictly  stochastically  increasing  in  7  if  for  every  z  in  the  range  of  Z 

P[Z.  >  li  7]  <  P[Z.  >  z;  7']  ‘f  7  <  i- 

Strict  monotonicity  of  the  marginal  IRF  with  respect  to  9 \  follows  under  the  reasonable  as¬ 
sumption  of  stochastic  order  in  ©1 : 

Theorem  3.1.  (See  Stout,  1989).  If  Q|0i  =  9X  is  strictly  stochastically  increasing  in  8 1  in  the 
sense  of  Definition  3.3  and  the  IRF  P(8)  is  strictly  monotone  in  (9 2, . .  ■  ,9d)  then  the  marginal  IRF 
of  P(9)  with  respect  to  9X  is  strictly  monotonic. 

Remark.  Note  in  Theorem  3.1  that  P{8)  is  not  assumed  to  be  strictly  monotone  in  8\.  the  first 
component  of  8  =  (9:,  82,  ■■  ■  9d). 
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A  note  on  IRT  model  assumptions  should  be  emphasized  here.  1RT  models  are  commonly  pa¬ 
rameterized;  that  is,  the  IRFs  and  ability  distribution  are  members  of  parametric  families.  Typical 
assumptions  are  that  0  is  unidimensional  with  a  standard  normal  distribution  and  that  a  two  or 
three  parameter  normal  ogive  model  or  a  one,  two,  or  three  parameter  logistic  model  is  assumed 
for  the  IRFs.  In  this  paper,  we  assume  only  that  the  IRFs  {P;(£)}  are  continuous,  with  0  usually 
multidimensional. 

4  Test  Bias  in  the  IRT  Model 

In  this  section  our  multidimensional  IRT  based  notion  of  test  bias  using  the  IRT  model  of  Section  3 
is  developed.  Section  4.1  provides  a  brief  presentation  on  IRT  item  bias  as  currently  usually  defined 
in  the  psychometric  literature.  Section  4.2  sets  up  the  multidimensional  IRT  framework  for  test 
bias  modeling;  target  ability  and  nuisance  determinants  are  defined.  Section  4.3  develops  test  bias 
in  terms  of  its  components:  potential  for  bias,  expressed  bias,  and  the  combining  of  expressed 
item  biases  through  a  test  scoring  method.  Section  4.-t  considers  item  bias  cancellation  when  the 
nuisance  determinants  are  multidimensional.  Finally,  Section  4.5  formally  considers  the  notion  of 
a  valid  subtest. 

4.1  Existing  IRT  Item  Bias  Definition 

In  this  section  the  concept  of  IRT-modeled  item  bias  (in  some  contexts  called  DIF,  for  differential 
item  functioning)  currently  in  widespread  use  is  presented  as  a  backdrop  for  the  development  of 
multiple-item  test  bias,  which  is  treated  in  Sections  4.2  and  4.3.  An  item  is  biased,  according 
to  Hambleton  and  Swaminathan,  (1985,  p.  285)  if  its  (necessarily  unidimensional)  item  response 
functions  across  groups  are  not  identical.  A  formal  definition  is  given  below. 

Definition  4.1.  Item  bias.  Let  two  groups  of  examinees  be  indexed  by  g  =  1,2.  For  each  g,  denote 

ILg  =  UNg)  (4-1) 

to  be  the  test  response  from  an  N -item  test  for  a  randomly  chosen  examinee  from  Group  g.  Assume 
that  a  unidimensional  IRT  model  fits  each  group,  with  IRT  representation  for  {LLg;g  =  1,2}  (recall 
Definition  3.1): 

{d=  1  ,Qg.Fg(9).{PJ(6)  :  g  =  1 . i-  l,t+l . X;Pig{0)}.g=  1.2}.  (4-2) 
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where  Fg{9)  denotes  the  cdf  of  Qg.  (Note,  as  the  subscript  notation  indicates,  that  all  items  except 
the  ith  item  have  group  invariant  IRFs  while  item  i  has  an  IRF  that  possibly  differs  for  the  two 
groups.) 

(i)  Item  bias  occurs  in  item  i  at  6  if  the  group  specific  probabilities  of  correct  response  at  9  are 

different;  i.e.,  the  group  IRFs  are  different  at  9: 

PiliO)  h  P[Uti  =  1  I  0!  =  9)  *  P[U, 2  =  1|02  =  0}=  Pl2(9). 

(ii)  Item  bias  occurs  in  item  i  if  there  exists  some  value  9  for  which  item  bias  occurs  at  8.  □ 

It  is  important  to  observe  that  the  “bias”  of  item  i  is  defined  relative  to  the  other  N  -  1  items, 
which  are  assumed  invariant  and  hence  “unbiased”  with  respect  to  the  two  groups. 

Item  bias  models  have  traditionally  been  parametric.  Wright,  Mead  and  Draba  (1976)  and  Hol¬ 
land  and  Thayer  (1988)  consider  a  biased  item  generated  by  Rasch  IRFs  with  the  IRF  difficulties 
(b’s)  different  for  the  2  groups.  The  more  general  2PL  and  3PL  models,  with  different  discrim¬ 
inations  (a’s)  and  guessing  parameters  (c’s)  across  group,  have  been  studied  by  Hulin,  Drasgow 
and  Komocar  (1982),  Linn,  Levine,  Hastings,  and  Wardrop  (1981),  and  Thissen,  Steinberg,  and 
Wainer  (1988),  among  many  others. 

Item  bias  addresses  differential  performance  across  group  for  a  single  item  at  a  time.  If  several 
items  display  bias  relative  to  the  remaining  assumed  group  invariant  items  according  to  Defini¬ 
tion  4.1-  modified  to  allow  several  IRFs  to  possibly  differ  across  group-there  are  no  components  in 
Definition  4.1  that  provide  the  facility  to  explain  simultaneous  item  biasing  due  to  a  single  under¬ 
lying  reason.  This  provides  the  motivation  for  an  IRT  framework  that  explains  such  pervasiveness 
of  item  bias. 

4.2  The  IRT  Framework  for  Multidimensional  Test  Bias 

In  our  treatment,  test  bias  is  modeled  using  the  nonparametric  multidimensional  IRT  framework 
described  in  Section  3.  The  multidimensionality  of  the  underlying  latent  abilities  for  the  two  groups 
provides  the  environment  from  which  bias  expresses  itself  in  one  or  more  items.  A  crucial  component 
in  this  test  bias  model  is  the  modeling  of  a  pervasive  nuisance  determinant,  which  contaminates  a 
significant  portion  of  the  test  items.  This  modeling  viewpoint  is  an  attempt  to  retain  the  view  that 
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bias  originates  at  the  test  question  level  yet  to  allow  for  the  possibility  of  bias  expressed  through  a 
test  score  as  in  the  classical  differential  regression  approach  discussed  above  in  Section  2. 

The  setup  of  the  multidimensional  IRT  model  for  a  test  administration  to  two  groups  is  as 
follows.  The  IRT  representation  (3-7)  is  assumed  to  hold  for  the  combined  two-group  population 
of  examinees.  This  representation  induces  a  separate  IRT  representation  of  the  form  of  (3-7)  for 
each  of  the  two  groups: 

{d,Qg,Fs(8),{Pig(8):  t  =  ,1V}},  <7  =1,2,  (4-3) 

where  Qg  here  denotes  0  restricted  to  Group  g,  Fg(6)  denotes  the  cdf  of  Qg,  and  Pig(8)  denotes  the 
ith  IRF  for  a  randomly  selected  examinee  from  the  subpopulation  of  Group  g  examinees  of  ability 
9.  Note  that  the  distribution  of  0j  will  in  general  be  different  from  that  of  02.  It  is  convenient  to 
denote  the  combined  two  group  IRT  representation  by 

{d,Qg,Fg(8),{P,g(8):  i  ~  1,. . .  N}  :  5  =  1,2}.  (4-4)  ‘ 

The  IRT  representation  (4-4)  will  be  assumed  throughout  the  remainder  of  Section  4  (with  (3-4), 
Assumption  3.1,  and  Assumption  3.2  assumed  to  hold  within  each  group  of  course).  Implicit  in 
(4-4)  is  the  assumption  that  the  test  measures  the  same  psychometrically-defined  ability  construct 
9  in  both  groups. 

Two  basic  assumptions  additional  to  Assumption  3.1  and  3.2  about  the  IRT  representation  (4-4) 
are  necessary:  (1)  common  multidimensional  IRFs  in  for  each  of  the  two  groups  in  the  representation 
(4-4)  (i.e.,  IRF  invariance  across  group)  and  (2)  the  capability  of  the  test  to  measure  (possibly  with 
contamination)  the  intended-to-be- measured  ability  ( target  ability ): 

Assumption  4.1.  In  the  assumed  IRT  representation  (4-4)  assume  IRF  group  invariance,  that  is 

Pn(s)  =  Pa  (£)  =  m)  (4  -  5) 

for  all  Q..  □ 

This  first  additional  assumption  states  that  the  usual  IRT  item  parameter  invariance  assumed 
in  unidimensional  IRT  modeling  is  assumed  to  hold  for  our  multidimensional  IRT  model,  where  8 
includes  all  the  abilities  influencing  test  performance  (hence  the  assumption  of  IRF  group  invariance 
is  appropriate  in  this  context).  Such  invariance  does  not  necessarily  hold  for  any  subset  of  the 
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components  of  0,  in  particular  not  for  the  target  ability  alone.  Indeed  if  invariance  with  respect  to 
target  ability  held  for  all  items  it  is  intuitively  clear  there  could  be  no  bias.  For  example,  in  the 
usual  definition  of  item  bias  (Definition  4.1)  invariance  is  not  assumed  for  the  biased  item.  (4-5)  is 
assumed  throughout  the  rest  of  the  paper. 

We  now  define  target  ability. 

Definition  4.2.  Target  ability  is  the  unidimensional  latent  ability  the  test  intends  to  measure. 
The  target  ability  component  is  denoted  by  9,  and  the  target  ability  random  variable  for  Group  g 
is  denoted  09. 

Remark.  09  is  not  to  be  confused  with  as  defined  in  (4-4). 

If  a  discussion  of  test  bias  is  appropriate  in  a  test  administration,  then  it  must  be  the  case  that 
the  test  is  designed  so  that  it  is  in  fact  measuring  0,  as  well  as  possibly  some  nuisance  components 
inadvertently.  We  thus  informally  make  the  second  additional  assumption  that  all  items  of  the 
test  in  fact  do  measure  target  ability  9  and  possibly  nuisance  components  77  as  well.  That  is,  all 
IRFs  Pi(0,T])  3X6  assumed  strictly  increasing  in  9  throughout  the  paper.  In  Shealy  (1989),  this 
assumption  is  formalized  and  it  is  then  proved  that  the  existence  of  a  representation  (4-4)  in  turn 
implies  the  existence  of  an  analogous  representation  in  terms  of  (0, 77);  that  is  in  terms  of  target 
ability  and  nuisance  components.  Here  we  bypass  presentation  of  this  formalism  and  instead  assume 
an  IRT  representation  of  the  form  (4-4)  with 

Qg  =  (Qg,vg)  (4-6) 

where  09  denotes  target  ability  and  77^  denotes  nuisance  ability  for  a  randomly  chosen  group  g 
examinee.  That  is,  the  two  group  IRT  representation 

K(09,Z7fl),F9(0,!i),{P^,7I),  i  =  1,...  ,1V}  :  g  =  1,2}  (4  -  7) 

where  the  P,’s  are  the  group  invariant  IRFs  guaranteed  to  exist  by  Assumption  4.1,  is  assumed 
throughout  the  remainder  of  the  paper. 

4.3  A  Multidimensional  Formulation  of  Test  Bias 

Item  bias  postulates  that  examinees  scaled  on  a  univariate  latent  9  (as  in  Definition  4.1)  display 
differing  item  response  probability  across  group  for  the  biased  item.  We  will  take  the  postulated 
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ability  9  to  be  the  target  ability  to  create  an  IRT-based  definition  ol  test  bias. 

As  in  item  bias  studies,  test  bias  of  this  sort  is  an  entity  studied  at  the  “micro  level”  of  each 
fixed  value  of  9;  so  one  may  speak  of  “test  bias  at  9” .  Test  bias  at  the  “macro  level”  may  be  defined 
to  exist  if  it  exists  at  one  or  more  single  0- values;  important  aspects  of  this  micro/macro  duality  are 
considered  in  Section  6.  The  following  formulation  of  test  bias  is  composed  of  three  components: 

(a)  The  potential  for  bias,  if  it  exists,  resides  within  the  multidimensional  target /nuisance  ability 

distributions  in  two  groups; 

(b)  potential  for  bias  is  expressed  in  items  whose  responses  depend  on  one  or  more  nuisance 
determinants;  and 

(c)  the  scoring  method  of  the  test,  to  be  viewed  as  an  estimate  of  target  ability,  transmits  expressed 

item  biases  into  test  bias. 

4.3.1  Potential  for  test  bias 

Before  the  concept  of  “potential  for  test  bias”  can  be  developed,  it  is  necessary  to  introduce  con¬ 
ditions  postulating  stochastic  ordering  of  ability  distributions. 

Consider  a  nuisance  ability  rjg,  assumed  unidimensional  for  simplicity  of  explication,  for  two 
groups  g,g  =  1,2.  Either  the  distribution  is  the  same  for  both  groups  or,  by  definition,  there  exists 
some  t)  for  which 

P[Vi  >  v]  /  ^[*72  >  V]- 

Say  that,  as  psychometricians,  we  believe  that  Group  1  has  “more”  of  this  ability.  Likely  the  most 
natural  way  to  mathematize  this  belief  is  to  assume  stochastic  ordering,  that  is  to  assume 

P[v i  >  v}>  P[v 2  > 

for  all  t].  For  rjl  and  tj2  that  possess  densities,  the  graphical  intuition  is  given  in  Figure  1.  For 
example,  as  Figure  1  suggests,  the  densities  might  be  identical  except  for  translation.  Of  course, 
if  two  groups  differ  in  ability  distribution,  it  does  not  follow  logically  that  one  or  the  other  group 
has  “more”  ability.  For  example,  a  situation  where  the  variances  of  and  rj2  are  not  equal 
produce 

P[V i  >  rl]  <  P[V 2  >  v]  for  p  >  0 
16 
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Figure  2: 


and 


P[Vi  <v]<  P[V2  <  v]  for  T)  >  0. 

In  particular,  t)\  and  7]2  might  be  symmetrically  distributed  about  0  with  tj2  having  the  larger 
variance,  as  displayed  in  Figure  2.  Nonetheless,  for  many  psychometric  applications,  it  seems 
plausible  to  assume  stochastic  ordering  whenever  ability  distributions  are  not  equal,  as  we  will  do 
below. 

The  potential  for  test  bias  is  modeled  via  one  or  more  determinants  that  simultaneously  cause 
bias  in  a  collection  of  items.  In  particular,  this  cause  is  rooted  in  the  conditional  distributions  of 
|  Qg  =  9  (note  that  77^  can  be  multidimensional  here).  For  a  fixed  9,  we  assume  stochastic 
ordering  for  the  distributions  of  77^  |  Qg  =  9{g  =  1,2)  when  they  are  not  equal: 

Assumption  4.2.  Let  (Qg,Vg)  be  as  in  (4-7)  and  fix  a  target  ability  value  9.  If  the  conditional 
distributions  17  |Oi  =  9  and  v2  I  O2  =  9  are  different,  then  the  assumption  is  that  either 

(V2  I  0i  =  0)  <  (t72|02  =  9)  or  (2l|©i  =  9)  >  (^|02  =  9)  (4-8) 

stochastically;  i.e.,  either 


P[n2  >  v  1  ©i  =  0]  <  i%2  >  77 1  02  =  9) 

(4-9) 

for  all  T)  or 

>  9  !  ©1  =  0]  >  P[v 2  >  V  |  02  =  9} 

(4-10) 

for  all  jj. 

□ 

For  example,  let  9  be  mathematical  ability  and  77  =  77  be  verbal  ability.  Then  (4-9)  says  among  all 
examinees  of  Mathematical  Ability  9  that,  stochastically,  Group  2  examinees  are  verbally  superior 
to  Group  1  examinees. 

With  the  above  preparation,  potential  for  test  bias  can  be  defined. 

Definition  4.3.  Let  two  groups  have  ability  distributions  ( Qi  ,7j1)  and  (02>V2)-  Potential  for  test 
bias  exists  with  respect  to  nuisance  determinant  r?  at  target  ability  level  9  if  either  (4-9)  or  (4-10) 
holds.  If  (4-9)  holds  a  potential  disadvantage  exists  against  Group  1  at  target  ability  9.  □ 

Definition  4.3  implies  that  a  potential  disadvantage  can  exist  only  if  there  is  a  nuisance  deter¬ 
minant  as  a  component  of  the  latent  ability  vector. 
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4.3.2  Expression  of  test  bias  potential 

In  order  for  test  bias  to  occur,  its  potential  must  be  expressed  in  one  or  more  items.  The  concept  of 
expressed  bias,  detailed  in  Definition  4.5  below,  is  similar  to  the  item  bias  concept  of  Definition  4.1. 
It  is  stated  in  terms  of  the  marginal  IRFs  with  respect  to  target  ability: 

Definition  4.4.  Refer  to  (4-7).  The  marginal  IRF 

Tia{0 )  =  E[Pi  (9a,  2fl)|03  =  9\ 

=  P[Ui  =  1  |  0S  =  0]  t  =  1,...,JV 

is  called  the  target  marginal  IRF  for  item  i,  Group  g.  □ 

We  can  now  define  expressed  bias  in  item  i  at  target  ability  9. 

Definition  4.5.  Let  {T,a(0)  :  i  =  1, . . . ,  N}  be  Group  g ’s  target  marginal  IRFs  for  a  test  with  IRT 
representation  (4-7). 

(i)  Expressed  bias  in  item  i  exists  at  target  ability  9  if  item  i’s  target  marginal  IRF  for  Group  1 

is  not  equal  to  the  corresponding  target  marginal  IRF  for  Group  2  at  9: 

T,i(9)  ?  Ti2(9). 

(ii)  Expressed  bias  in  item  i  exists  if  there  is  some  value  9  for  which  expressed  bias  for  item  i  exists 

at  9. 

Item  i  is  biased  against  Group  1  at  9  ifTn(6 )  <  Ti2(9).  □ 

Definition  4.5  (our  multidimensional  IRT  expressed  item  bias  definition)  is  equivalent  to  Defi¬ 
nition  4.1  (the  usual  IRT  item  bias  definition)  if 

(i)  the  IRT  models  represented  by  (4-2)  and  (4-7)  are  both  IRT  representations  of  {U 0  :  g  =  1,2), 

(ii)  the  ability  9  of  (4.2)  is  the  target  ability  9  of  Definition  4.2,  and 

(iii)  the  group-dependent  IRF  F'A-)  from  (4-2)  is  taken  to  be  the  target  marginal  IRF  Ti3(-)  from 
Definition  4.4. 

Henceforth  in  the  paper,  “item  bias”  will  refer  specifically  to  the  expressed  item  bias  of  Defini¬ 
tion  4.5. 

The  link  between  potential  for  bias  and  expressed  bias  for  an  item  is  the  heart  of  test  bias.  The 
following  theorem  is  fundamental  in  establishing  this  link. 
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Theorem  4.1.  Assume  IRT  representation  (4-7)  and  fix  the  number  9.  If  Pi(9,p)  is  strictly 
increasing  in  p  and  a  potential  disadvantage  exists  against  Group  1  at  9  then  item  i  is  biased 
against  Group  1  in  the  sense  of  Definition  4.5. 

Proof.  The  result  is  an  immediate  corollary  of  Theorem  3.1. 

Remark.  In  a  sense,  Theorem  4.1  formalizes  the  obvious;  dependence  of  an  item  on  nuisance 
determinants  with  respect  to  which  one  group  is  disadvantaged  causes  expressed  item  bias. 

4.3.3  Transmission  of  expressed  item  biases  into  test  bias 

Until  now  the  discussion  has  focused  on  a  single  item;  we  shall  see  that  a  test  can  consist  of  many 
items  simultaneously  biased  by  the  same  nuisance  determinant.  In  this  case,  items  can  cohere  and 
act  through  the  prescribed  test  score  to  produce  substantial  bias  against  a  particular  group  even  if 
individual  items  display  undetectably  small  amounts  of  item  bias. 

This  is  the  final  component  of  our  formulation  of  test  bias  mentioned  at  the  beginning  of  this 
section.  We  consider  the  large  class  of  test  scores  of  the  form 

h(LL)  (4-11) 

where  h(u)  is  real  valued  with  domain  all  u  =  (uj , . . . ,  ujv)  such  that  ut-  =  0  or  1  for  i  =  1, , . . , N 
and  h(u )  is  coordinate  wise  non-decreasing  in  u.  This  class  contains  many  of  the  standard  scoring 
procedures  for  many  standard  models;  for  example,  number  correct,  linear  formula  scoring  of  the 
form  with  a,  >  0,  maximum  likelihood  estimation  of  ability  for  certain  logistic  models 

with  item  parameters  assumed  known,  etc.  One  is  surely  willing  to  restrict  attention  to  test  scores 
of  the  form  (4-11),  if  the  test’s  IRFs  are  known  to  be  increasing.  Following  Rosenbaum  (1985),  test 
scores  of  the  form  (4-11)  will  be  called  non-decreasing  item  summaries. 

Test  bias  is  defined  with  respect  to  a  specific  test  scoring  method  h(u). 

Definition  4.6.  A  test  H_  with  target  ability  0  and  test  score  h(H )  displays  test  bias  against 
Group  1  at  9  if 

E[h([L 1 )  |  0!  =  9]  <  E[h(LL2)  i  02  =  9).  (4  -  12) 
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If 


E[h(H ,)  !©!  =  «)  =  ElhUU)  I  02  =  9] 


(4-13) 


then  no  test  bias  exists  at  9.  □ 

The  psychometric  interpretation  of  Definition  4.6  is  as  follows.  The  left  side  of  (4-13)  is  the 
expected  test  score  for  a  randomly  chosen  Group  1  examinee  with  target  ability  9  while  the  right 
hand  side  is  the  same  for  a  randomly  chosen  Group  2  examinee  with  target  ability  9.  In  order  to 
assess  the  appropxi„tene0s  of  Definition  4.6,  consider  a  large  number  of  Group  1  and  a  large  number 
of  Group  2  examinees  taking  the  test,  all  of  target  ability  9.  Then  (4-13)  says  that  the  average 
score  of  these  Group  1  examinees  will  be  approximately  the  same  as  that  of  the  Group  2  examinees. 
Thus,  on  average ,  neither  group  is  favored  in  the  attempt  to  estimate  target  ability  using  h(U_g). 

4.3.4  A  fundamental  relationship 

We  now  elucidate  how  the  three  conceptual  components  of  our  formulation  interact  to  produce  test 
bias.  For  ease  of  interpretation  we  restrict  ourselves  to  the  case  of  a  unidimensional  77;  however, 
the  following  results  hold  if  a  vector- valued  nuisance  determinant  tj_  is  assumed. 

The  basic  test  bias  result  is  given  in  Theorem  4.2,  namely  the  precise  mechanism  by  which 
potential  for  bias  is  transmitted  into  test  bias.  First  a  variation  of  a  well-known  lemma  is  needed, 
which  for  convenience  is  specialized  to  the  present  setting. 

Lemma  4.1.  Let  f(rj)  be  strictly  increasing  in  77  and  let  stochastic  ordering  in  the  sense  of  (4-9) 
hold  for  each  fixed  9.  Then  for  each  fixed  9 

E[f(Vi)\Qi  =  9]<E[f(V2)\Q2  =  9). 

Proof.  Fix  9  and  let  Fg( 77)  denote  the  cdf  of  f(rjg)  ]  0S  =  9.  Assume,  for  simplicity  of  argument 
and  without  loss  of  generality,  that  F,(0)  =  0  for  g  =  1,2.  Then 

E[f(TI3)\eg=0}=  f°°  xdFg(x). 

Jo 

Integration  by  parts  yields 

r  xdFg(x)  =  r (\  -  Fg(x))dx.  (4-14) 

Jo  Jo 
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But  (4-9)  and  f(g)  strictly  increasing  implies  that 


F\(x)  >  F2(x)  for  all  x  >  0. 

Using  (4-14)  and  noting  that 

/°°(1  —  F\(x))dx  <  /°°(1  -  F2(x))dx, 

Jo  Jo 

the  desired  result  follows.  □ 

The  theory  of  associated  random  variables  is  helpful  in  establishing  the  basic  test  bias  result. 
As  defined  by  Esary,  Proschan,  and  Walkup  (1967),  a  random  vector  A  is  associated  if,  and  only 
if,  for  all  nondecreasing  /(x),  g( x),  it  follows  that 

cov(f(X),9(X))>  0.  (4-15) 

The  main  result  of  Esary,  Proschan,  and  Walkup  (1967)  that  we  wish  to  use  is  that  a  vector  of 
independent  random  variables  is  associated.  The  basic  result  can  now  be  stated  and  proved. 

Theorem  4.2.  Assume  IRT  representation  (4-7)  with  g  ~  g  being  unidimensional.  Fix  the  number 
9  and  assume  the  test  scoring  method  of  the  form  (4-11).  Suppose  for  some  i  that  h(u)  is  strictly 
increasing  as  it,  =  0  increases  to  u,  =  1  and  that  P{(9,g)  is  strictly  increasing  in  g.  Assume 
potential  for  bias  at  9  against  Group  1,  i.e.,  that  (4-9)  holds.  Then  test  bias  at  9  against  Group  1 
holds. 

Proof.  It  suffices  to  prove  (4-12).  By  IRF  invariance  with  respect  to  ( 9,r /),  it  follows  for  all  g 
and  the  fixed  9  that 


E[h(U_x)  i  Or  =  9,r)l  =  g]  =  E[h{U.2)\Q2  =  9,t]2  =  g]  (4  -  16) 

Conditioning  on  09  =  9,  Tjg  =  77  will  be  denoted  by  9,  77.  Let 

f(V)  =  E{h(lLg)\9,r,\, 

Note  that  /(p)  does  not  depend  on  g  by  (4-16),  hence  let  JJ_  =  IL\  throughout  the  remainder  of  the 
proof.  We  first  show  that  _/"( 77 )  is  strictly  increasing  in  g.  Fix  g'  >  g.  Then,  by  local  independence 

=  u^PAo.rnu'(i-pt(9.g')y-^ 
q  p[iL  =  iL\o.g ]  n:=i p.(M)u'U - p,(M))i-u- ' 
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Thus,  q(u)  is  strictly  increasing  as  u,  =  0  increases  to  u,  =  1  because  P,(0,  t)')/P,(8,  tj)  >  1.  Now 


M')  -  £(M£)IM)  =  EM(P\lL  =  u.\0,i]-P\lL  =  u\o,r1}) 

=  £u  Hu)[q(u)  -  1  }P\U  =  u  |  0,  r;]  (4  -  17) 

=  cou(h(E),?(E)  -  l\9,rj). 

Partition 

H  =  (U\U„Ut)  =  (LL',U,) 


where 

EW,  =  {Uu...Mi-uUi+1 . Us). 


Let  E\y  and  cov\y  denote  expectation  and  covariance  over  the  distribution  of  W,  respectively.  By 
a  basic  identity  for  covariance,  stated  here  conditional  on  (0, 77), 

cov(h(H)yq{H)  ~  1  I  M)  =  E^{covv\h{U.),q{lL)  ~  MM]  I  M}  ,,  _1S, 

+  cov!j.{Eu,(h(U)\d,i1),Eu,(q(U)-U^v)\0,v}  '  1  ’ 

Both  h(u)  and  q(u)  -  1  are  strictly  increasing  as  u,  =  0  increases  to  u,  =  1.  Thus,  for  all  possible 
values  of  U', 

covUt[h{H),q(Li)  -  l|0.q]  >  0. 

Thus,  the  first  term  on  the  right  hand  side  of  (4-18)  is  strictly  positive.  Because  of  the  association 
of  independent  random  variables  and  the  fact  that  U_'  given  0 ,  rj  has  independent  components,  it 
follows  that  the  second  term  on  the  right  hand  side  of  (4-18)  is  nonnegative,  using  also  the  fact 
that 

Eu,(h(U)\0,V)  and  EuM(LL)  -  1  I  M) 


are  nondecreasing  in  E*-  Thus, 


cov{h(LL),q(lT)  -  1  |  0,t?)  >  0. 


But,  recalling  (4-17), 

E{h{lL)\6,n')-  E(h(H)\9,r))>0; 

that  is,  7(77)  is  strictly  increasing  in  77,  as  claimed.  Then,  applying  Lemma  4.1  and  (4-9)  to  /( 77) 
above,  it  follows  that  (4-12)  holds,  as  required.  □ 
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Remarks. 


(i)  It  is  important  to  reemphasize  that  Theorem  4.2  holds  if  a  vector  valued  nuisance  parameter 

77  is  assumed,  provided  (4-9),  the  potential  for  bias  at  9 ,  holds  for  tj.  That  is,  the  nuisance 
determinants  771, .. .  ,77^-1  must  each  create  bias  in  the  same  direction,  say  against  Group  1. 

(ii)  Stripped  of  its  test  bias  context  and  stated  as  a  general  theorem  about  IRT  models,  a  mi¬ 
nor  variant  of  Theorem  4.2  with  <  replaced  by  <  at  appropriate  places  is  due  to  Rosen¬ 
baum  (1985).  For  our  purposes,  strict  inequality  is  needed.  The  proof  of  Rosenbaum's  result 
is  similar  to  our  proof. 

A  final  interesting  relationship  to  note  is  that  the  presence  of  test  bias  implies  the  potential  for 
test  bias. 

Theorem  4.3.  Suppose  that  test  bias  against  Group  1  holds  at  9  in  the  sense  of  (4-12).  Then  the 
potential  for  bias  against  Group  1  at  9  exists  in  the  sense  that  (4-9)  holds. 

Proof.  Recall  (4-16),  replacing  77  by  77  there.  Thus  for  <7  =  1,2,  it  holds  that 

E\HILg)  \0g  =  8}  =  J  £[M£i)  I  ©i  =  8,Rl  =  #Ts(2|*)  (4  -  19) 

where  Fg(g  |  9)  is  the  cdf  of  Vg  \  Qg  =  8.  Suppose  (4-12).  Thus,  using  (4-19)  for  g  -  1,2  it  follows 
that 

j  E[h(lL  1)  I  01  =  =  T7]dFi(77  ]  8)  <  J  ElhULt)  I  0i  =  8. 2l  =  n\dF2(v  |  9). 

But  this  implies  that  the  distributions  of  77  |  0j  =  0  and  q>2  |  Q2  =  9  are  different.  Thus,  invoking 
Assumption  4.3,  it  follows  that  (4-9)  holds.  □ 

4.4  Item  Bias  Cancellation 

As  discussed  above,  and  epitomized  by  Theorem  4.2,  items  can  combine  to  amplify  bias  at  the  test 
level.  In  contrast,  items  displaying  bias  can  also  tend  to  cancel  each  other  out,  thus  producing 
little  or  no  bias  at  the  test  level.  This  becomes  possible  only  when  the  nuisance  determinant  77  is 
multidimensional  with  some  of  its  components  displaying  potential  for  bias  against  Group  1  and 
others  displaying  potential  for  bias  against  Group  2.  The  amount  of  expressed  test  bias  will  be 
a  result  of  the  amount  of  cancellation  at  the  test  level  and  will  be  dependent  on  the  particular 


23 


test  score  h(u)  used.  The  theme  of  cancellation  has  been  presented  by  Humphreys  (1986)  and 
Roznowski  (1987)  in  the  non-IRT  classical  predictive  validity  context. 

The  following  example  illustrates  how  cancellation  can  function  to  produce  negligible  test  bias. 

Example  4.1.  A  test  of  length  N  ( N  an  even  number  for  convenience)  intended  to  measure 
calculation  skills  has  IRT  representation 

{{d  =  3,(0g,T7l3,T72s),F5(0,r?i,r?2),{-P.(^,7i,T?2)  :  *  =  g  =  1,2} 

where  9  =  mathematics  skills,  rjj  =  physics  knowledge,  and  rj2  =  reading  knowledge.  Let  Si  be  a 
subtest  with  IRFs 

{P,(<9,pi)  :  i  =  l,...,y} 

(subtest  containing  problems  with  a  mathematical  physics  flavor)  strictly  increasing  in  pi  for  every 
9  and  S2  be  a  subtest  with  IRFs 

(subtest  containing  mathematical  “word  problems”)  strictly  increasing  in  rj2  for  every  9.  Suppose 
that  the  ith  physics  IRF  is  identical  to  the  ith  word  problem  IRF,  which  is  the  (y  +  i)th  item. 

Now,  condition  on  a  particular  mathematics  ability  9 ,  and  assume  for  examinees  of  ability  9  that 
Group  2  has  greater  knowledge  of  physics  and  Group  1  has  greater  reading  skill.  So  t712  |  9  >  77n|0 
stochastically  and  r)n  |  6  >  7722| 9  stochastically  for  each  choice  of  9.  Say  that  this  holds  for  each 
choice  of  9.  Furthermore  suppose  that  as  distributions,  7712  |  9  =  tj21  |  9  and  rjn  |  9  =  r)22  ]  9  for 
ail  9.  Then  by  Theorem  4.2,  if  subtest  S\  »ve;c  the  entire  test,  it  would  exhibit  test  bias  against 
Group  1  at  9  for  every  9.  By  contrast  if  52  were  the  entire  test,  it  would  exhibit  test  bias  against 
Group  2  at  9  for  every  9.  But,  for  a  large  class  of  test  scores-those  giving  approxjmatelv  equal 
weight  to  the  Si  items  and  to  the  Sj  items-almost  total  cancellation  of  the  item  biases  could  occur 
thus  producing  an  unbiased  test.  That  is,  for  such  a  test  scoring  method  h(u), 

E[hULi)  I  ©i  =  8]  =  E[h{H2)  |  02  =  9) 

for  every  9.  Indeed  if  h{u)  is  number  correct,  then  exact  equality  and  hence  total  cancellation 
results. 
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Remark.  Note  that  the  concept  of  test  bias  compares  groups,  not  individuals.  For  a  particular 
examinee,  a  test  might  be  biased  against  her,  even  though  the  test  is  not  biased  against  Group  1  of 
which  she  is  a  member.  This  important  aspect  of  bias  is  an  unfortunate  consequence  of  the  multidi¬ 
mensional  nature  of  items  in  most  tests.  Moreover,  it  is  also  a  consequence  of  the  unfortunate  (and 
perhaps  economically  unavoidable)  fact  that  only  statistical  (i.e.,  group-level)  bias  analysis  is  done, 
as  opposed  to  individual  case-by-case  analysis.  The  above  discussed  phenomenon  of  cancellation 
could  possibly  alleviate  the  impact  at  the  individual  examinee  level  (as  well,  as  just  discussed,  as 
at  group  level). 

It  is  worthwhile  to  develop  item  bias  cancellation  in  a  formal  manner. 

Definition  4.7.  Item  bias  cancellation  at  9  is  said  to  occur  if  the  test  consists  both  of  items  biased 
against  Group  1  at  9  and  items  biased  against  Group  2  at  9. 

Remark.  It  is  theoretically  possible  that  cancellation  could  occur  within  an  item  if  the  item 
depends  on  at  least  two  nuisance  dimensions,  as  contrasted  with  the  between  item  cancellation 
of  Definition  4.7.  This  source  of  cancellation,  which  seems  less  likely  to  occur  in  practice,  is  not 
considered  in  this  paper. 

Intuitively,  the  presense  of  expressed  item  bias  and  no  cancellation  implies  test  bias.  This  is 
the  content  of  Theorem  4.4. 

Theorem  4.4.  Assume  that  at  least  one  item  displays  expressed  item  bias  at  9  in  the  sense  of 
Definition  4.5,  and  assume  that  no  item  bias  cancellation  occurs  at  9.  Then  test  bias  occurs  at  9  in 
all  non-decreasing  item  summary  test  scores  h(u)  (see  (4-11))  provided  h(u)  is  strictly  increasing 
in  at  least  one  coordinate  corresponding  to  one  of  the  biased  items. 

Proof.  At  the  item  level,  each  item  is  either  biased  only  against  one  group  (Group  1,  say)  or 
displays  no  expressed  bias  by  the  assumption  of  no  cancellation.  Thus,  for  all  i, 

P[U,\  =  1  |  0,  =  9)  <  P{Ut2  =  1  |  02  =  9)  (4  -  20) 

with  strict  inequality  for  at  least  one  i.  Now.  bv  item  invariance,  for  all  i , 

F[r„  =  1  I  0,  =  9.%  =  2]  =  P[U,2  =  1  I  02  =  9.t±2  =  rj]  =  PA9.TJ). 
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Recall  Assumption  4.2.  Note  that,  denoting  the  cdf  of  (ijg  |  Qg  =  6)  by  Fg(g  |  9), 

P[Utg  =  110,  =  8]  =  J  PMtfdFg (tt  |  9) 

where  the  integrand  does  not  depend  on  g.  It  follows  from  Assumption  4.2  that  strict  inequality  in 
(4-20)  for  some  i  implies  that  (t^  |  0,  =  9)  <  (r?2|©2  =  9)  stochastically.  Thus  using  the  monotone 
condition  for  h(u)  the  conclusion  follows  from  Theorem  4.2,  noting  the  remark  following  the  proof 
of  Theorem  4.2  concerning  multiple  nuisance  determinants.  □ 

It  is  interesting  to  note,  as  Theorem  4.5  now  states,  that  when  there  is  no  item  bias  cancellation 
that  test  bias  for  number  correct  is  equivalent  to  test  bias  for  all  nondecreasing  item  summary  test 
scores  with  strict  increase  for  at  least  one  coordinate  of  u. 

Theorem  4.5.  (a)  If  test  bias  at  9  occurs  for  the  test  score  number  correct  (£2”=i  u«)  and  there  is 
no  item  bias  cancellation  at  9,  then  test  bias  occurs  at  9  for  every  nondecreasing  item  summary  test 
score  h(u )  for  which  h(u )  is  strictly  increasing  in  at  least  one  coordinate  of  u.  (b)  If  test  bias  at  9 
holds  for  some  nondecreasing  item  summary  test  score  h(u )  and  there  is  no  item  bias  cancellation 
at  9,  then  test  bias  at  9  hold  for  h(u)  =  £"=1  u,. 


N 


Proof.  Note  that 

£[£  U'9  i  ©*  =  0]  =  E  /  Pi(0,V)dFg(r,  I  9). 

»= l  i=i J 

Then,  obvious  and  minor  modifications  in  the  proof  of  Theorem  4.4  suffice  to  prove  both  (a)  and 
(b).  Details  are  omitted.  □ 

Intuitively,  no  test  bias  and  no  cancellation  implies  that  none  of  the  items  display  bias.  This  is 
the  content  of  Theorem  4.6. 


Theorem  4.6.  Assume  that  no  test  bias  exists  at  9  with  respect  to  score  h(u).  Assume  no  item 
bias  cancellation  at  9  in  the  sense  of  Definition  4.7.  In  addition,  assume  that  there  exists  at  least 
one  i  such  that  both  Pi{9,r£)  is  strictly  increasing  in  r/  and  h(u)  is  strictly  increasing  as  u,  =  0 
increases  to  u,  =  1.  Then  there  is  no  potential  for  test  bias  and  (hence)  none  of  the  items  display- 
item  bias. 
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Proof.  By  assumption  of  no  test  bias  at  0, 


E[h(E 0  l  01  =  9}  =  E[h{lL2)  I  02  =  e).  (4  -  21) 

By  the  strict  increasing  assumption  for  P;(0,r?)  and  h(u ),  it  follows  that  ElhCU^)  \  0,rj]  is  strictly 
increasing  in  77.  Recall  (4-19).  If  either  (4-9)  or  (4-10)  were  to  hold,  it  would  thus  be  impossible  for 
(4-21)  to  hold.  Thus  by  regularity  Assumption  4.2,  it  follows  that  (^  |  0i  =  0)  =  (^  |  02  =  0) 
stochastically;  i.e.,  there  is  no  potential  for  test  bias.  Referring  to  Theorem  4.1,  we  see  that  none 
of  the  items  display  item  bias.  ° 

Remarks. 

(i)  Assuming  a  scoring  method  really  dependent  on  all  items  and  that  at  least  one  of  the  items 

actually  depends  on  77,  Theorem  4.6  implies  that  if  there  is  potential  for  bias,  then  either  test 
bias  results  or  item  bias  cancellation  results  (and  possibly  both  result  simultaneously). 

(ii)  Theorem  4.2  and  4.6  can  be  together  interpreted  as  stating  a  set  of  conditions  under  which 
the  potential  for  test  bias  is  equivalent  to  test  bias. 

4.5  Valid  Subtest 

Recall  the  informal  definition  of  a  valid  subtest  from  Section  2.  As  mentioned  therein,  the  reason 
for  requiring  a  valid  subtest  to  exist  is  that  it  is  statistically  impossible  to  detect  test  bias  using 
only  data  from  an  ability  test  unless  there  exists  an  internal  criterion  measuring  only  the  target 
ability;  i.e.,  a  valid  subtest.  Here  we  formally  define  the  validity  of  a  subtest.  Let  0  denote  the 
target  ability.  Recall  from  Section  4.2  that  all  IRFs  are  assumed  strictly  increasing  in  0. 

Definition  4.8.  Let  LL  be  a  test  response  with  IRT  representation  (3-7),  let  9  =  {9,tj),  and  let  S 

be  a  subset  of  the  items  1 . N.  S  is  a  valid  subtest  if  the  IRFs  of  all  items  in  S  depend  only  on 

9;  i.e.,  P,{0,Tf)  =  P,(0)  for  each  i  in  S. 

Remarks. 

(i)  From  a  practical  viewpoint  one  wants  S  to  consist  of  as  many  of  the  items  of  the  test  as 
possible;  the  statistical  power  of  detecting  test  bias  increases  as  the  proportion  of  valid  items 
does. 
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(ii)  Consider  a  specified  nondecreasing  item  summary  scoring  method  h(u)  for  a  test  response  U_ 
(recall  (4-11).  Suitably  restrict  this  scoring  method  to  a  subtest  response  H\  denoting  it  by 
h'(u').  For  example,  if  h(u)  =  E^jiq/fV,  then  /i'(u')  =  E 'ui/N1  is  the  obvious  “restriction”, 
where  N'  is  the  cardinality  of  U)  and  E'  denotes  summation  over  the  components  of  ^7,.  A 
plausible  alternative  definition  of  subtest  validity  consistent  with  this  paper’s  emphasis  on 
the  expression  of  bias  at  the  test  level  expressed  through  the  test  score  would  be  to  require 
of  h'(u')  that  for  all  9,  given 

E[h\U!)  |  (0,7/)  =  (6,2)] 

depends  only  on  9  and  not  on  r).  This  assertion  is  equivalent  to  asserting  for  all  (9,2)  that 

E(h'(U’)  |  (0,7?)  =  (9,  t?)]  =  E(ti(U!)  |  0  =  9).  (4  -  22) 

(4-22)  is  appealing  as  a  possible  definition  of  subtest  validity  because  it  functions  in  an 
aggregate  way  at  the  test  level  based  on  the  specified  test  scoring  method  “restricted”  to  the 
subtest.  Evoking  the  usual  empirical  interpretation  of  expectation,  (4-22)  says  that  repeated 
sampling  of  examinees  from  ability  groups,  both  with  the  same  value  of  9  but  with  any  choice 
of  two  different  values  of  77  produces  on  average  approximately  the  same  value  of  h'(JJ)),  as 
one  would  wish  a  “valid  subtest”  to  do. 

Fortunately,  however,  this  alternate  and  appealing  definition  is  actually  equivalent  to  our  Def¬ 
inition  4.8,  under  the  natural  and  mild  regularity  condition  that  h'(u')  be  strictly  increasing 
as  Ui  =  0  is  increased  to  u,  =  1  for  each  component  u,’  of  u';  that  is  that  h'(u')  must  really 
depend  on  each  of  the  valid  subtest  item  responses.  This  assertion  follows  from  a  modifier 
tion  of  the  proof  of  Theorem  4.2.  Thus  our  definition  of  subtest  validity  can  be  thought  of  as 
operating  either  at  the  item  level  (Definition  4.8)  or  at  the  test  level  ((4-22)). 

(iii)  Assume  a  two  group  representation  (4-3).  It  is  perhaps  interesting  to  note  it  is  possible  for 
all  9  that 

E[h’(U!,)  |  01  =  9)  =  E[h'(U!2)  |  02  =  9)  (4  -  23) 

and  yet  subtest  validity  not  hold.  Note  here  that  (4-22),  equivalent  to  subtest  validity, 
implies  (4-23);  however,  (4-23)  should  not  be  used  as  a  definition  of  subtest  validity.  As  an 
extreme  example  demonstrating  this  claim,  each  item  of  5  could  be  measuring  2  alone  with 
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ijg  independent  of  0^  for  g  =  1  and  g  =  2  and  rjg  having  the  same  distribution  for  g  =  1,2. 
Subtest  validity  obviously  does  not  hold  here  because  the  supposed-to-be  valid  items  may  be 
heavily  influenced  by  tj;  however, 


E[h'{U!x)  |  0i  =  0] 


E[E{h’(J&)  |  0i  =  9,^}  |  0i  =  9) 
E[E{h'  O//i)|Z71}|©i=0] 
Eh'{U!x) 

Eh'{U!2)  =  . . . 

E[h'(U!2)  |  02  =  0]; 


(4-24) 


so  (4-23)  does  hold  here.  The  point  we  have  just  shown  is  that  the  absence  of  test  bias  (i.e., 
that  (4-23)  holds)  does  not  imply  test  invalidity  (i.e.,  that  (4-22)  fails).  Related  to  this  fact, 
note  that  test  validity  for  the  entire  test  in  the  sense  that  (4-22)  holds  for  all  (0, 77)  for  some 
scoring  method  h(u)  that  is  increasing  in  every  component  u;  of  u  does  imply  for  every  9  that 
no  test  bias  exists.  This  follows  trivially  from  the  fact  that  test  validity  for  the  entire  test 
means  that  every  item  depends  only  on  0. 


5  Test  Bias:  The  Long  Test  Case 

The  theory  of  test  bias  presented  in  Section  4  shows  that  if  there  is  at  least  one  nuisance  dimension 
then  test  bias  may  be  present.  It  is  well  known  that  purely  unidimensional  tests  are  rare  among 
typical  aptitude  and  achievement  tests  (see  Ansley  and  Forsyth  (1985),  Humphreys  (1984),  Reckase, 
Carlson,  Ackerman,  and  Spray  (1986),  and  Yen  (1984),  among  others).  The  position  is  summarized 
well  in  Humphreys  (1984): 

The  related  problems  of  dimensionality  and  bias  of  items  are  being  approached  in  an 
arbitrary  and  oversimplified  fashion.  It  should  be  obvious  that  unidimensionality  can 
only  be  approximated.  . . .  The  large  amount  of  unique  variance  in  items  is  not  random 
error,  although  it  can  be  called  error  from  the  point  of  view  of  the  attribute  that  one  is 
attempting  to  measure.  ...We  start  with  the  assumption  that  responses  to  items  have 
many  causes  or  determinants. 

How  does  the  empirical  reality  of  multiple  determinants  on  a  test  interact  with  our  multidimen¬ 
sional  model  of  test  bias?  There  are  two  cases  to  consider:  either  the  test  is  “long”  or  it  is  “short”. 
By  “long”  it  is  meant  that  the  number  of  items  is  large  enough  that  asymptotic  probabilistic  ar- 
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guments  provide  a  useful  approximation  to  the  actual  test  operating  characteristics.  For  example, 
for  many  purposes  a  test  of  40  items  can  be  classified  as  “long”. 

In  the  case  of  a  short  test,  several  of  the  results  in  Section  4  are  important:  First,  even  if 
nuisance  determinants  are  present  in  the  items  and  influence  examinee  performance,  the  potential 
for  bias  against  a  group  must  exist  in  order  for  test  bias  to  be  possible.  Second,  if  the  amount  of 
expressed  bias  at  the  item  level  is  sufficiently  small,  then  the  amount  of  bias  possible  at  the  test 
level  is  bounded  above.  However,  if  little  or  no  cancellation  occurs,  small  amounts  of  bias  at  the 
item  level  can  produce  a  substantial  amount  of  test  bias.  Indeed,  one  can  imagine  a  detrimental 
amount  of  test  bias,  but  with  statistical  testing  for  individual  item  bias  being  unable  to  detect 
any  bias  at  the  item  level.  Third,  the  amount  of  test  bias  is  dependent  upon  the  scoring  method, 
the  scoring  method  being  the  link  between  item  and  test  bias.  It  is  possible  that  some  scoring 
methods  might  be  more  robust  against  the  detrimental  influence  of  item  bias  than  others.  Fourth, 
recalling  Example  4.1  and  the  material  on  item  bias  cancellation,  it  is  quite  possible  to  minimize, 
with  the  help  of  an  aptly  chosen  scoring  method,  the  amount  of  test  bias  by  having  different  biasing 
influences  cancelling  each  other  out.  For  example,  (again  recall  Example  4.1)  if  approximately  equal 
numbers  of  items  express  approximately  equal  amounts  of  bias,  respectively  against  and  in  favor 
of  Group  1,  then  provided  the  scoring  method  gives  approximately  equal  weight  to  the  two  classes 
of  items,  little  or  no  test  bias  should  occur.  Intuitively,  it  seems  likely  that  having  many  minor 
dimensions  in  addition  to  9  might  increase  the  propensity  for  cancellation  and  actually  result  in 
less  test  bias.  However,  in  spite  of  certain  encouraging  aspects  of  the  above  remarks,  it  is  surely 
the  fact,  because  of  the  intrinsic  multidimensional  nature  of  ability  tests,  that  serious  amounts  of 
test  bias  are  likely  when  tests  are  short. 

We  now  turn  the  discussion  to  the  development  of  a  “long”  test  scenario.  In  the  study  of  test 
bias  in  a  long  test,  the  theory  of  essential  unidimensionality  of  a  test,  as  developed  by  Stout  (1987, 
1989)  and  refined  by  Junker  (1989a,  b)  turns  out  to  be  useful.  First  we  summarize  the  relevant 
concepts  of  this  theory. 

A  “long”  test  response  is  conceptualized  as  being  the  initial  observed  segment  of  a  potentially 
observable  infinite  item  pool  {£/,-, i  >  1}.  It  is  assumed  that  whatever  process  has  been  used  to 
construct  the  first  N  items  of  the  pool  (i.e.,  the  observed  test  £7^)  could  have  been  continued  in  the 
same  manner  to  produce  >  1}.  With  this  understanding,  in  order  to  do  asymptotic  statistical 
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theory  and  for  foundational  purposes,  we  study  {t/,, i  >  1}  instead  of  Ujy  =  {£/,,  1  <  i  <  N}, 
conceptualizing  the  item  pool  {{/, ,i  >  1}  as  the  “test”.  A  test  {Ui,i  >  1}  is  defined  to  be 
essentially  unidimensional  ( ds  =  1)  if  it  has  an  IRT  representation  with  monotone  IRFs  but  instead 
of  requiring  local  independence  (Assumption  3.1),  the  weaker  assumption  is  required  that 

Zi<i<i<s\™>(Ui*Ui\Q  =  *)\  0  (5_1} 

N  \ 

2  ) 

as  N  —*  oo  for  every  0.  (The  requirement  of  monotonicity  can  be  weakened  somewhat  when 
modeling  items  where  non-monotonicity  is  suspected,  but  we  omit  discussion  here  (see  Stout,  1989; 
Junker,  1989b).  When  d.£  —  1,  it  is  shown  that  the  latent  ability  is  unique  in  the  sense  that  any 
other  d.£  =  1  IRT  representation  has  a  latent  trait  that  is  a  monotone  rescaling  of  9.  (E.g.,  a 
mathematics  test  cannot  be  a  test  of  geography  for  the  reason  that  there  exists  no  such  rescaling.) 

We  now  must  specify  a  class  of  scoring  methods  for  the  sequence  of  long  tests  {LLn^N  >  1}. 
It  is  convenient  to  consider  a  large  class  of  such  scoring  methods,  but  less  extensive  than  the  non¬ 
decreasing  item  summaries  (4-11).  Recall  from  mathematical  analysis  that  a  collection  of  functions 
{fc/v(x)}  is  equicontinuous  if  for  every  €  >  0  there  exists  6  >  0  such  that 

\kN(x)  -  kN(y) |  <  e 

for  all  N  and  all  x,  y  for  which  |x  -  y|  <  6.  Note  that  the  assumed  continuity  is  uniform  both  in 
the  argument  and  in  the  choice  of  function. 

Definition  5.1.  {k^/C^^Ly  a^iUi)}  is  called  an  equicontinuous  balanced  scoring  method  provided 

(a)  kx(x)  is  defined  on  [0,1],  is  non-decreeing,  and  satisfies 

-oo  <  inf  kw( 0)  <  supfc;v(0)  <  inf  &;v(l)  <  supfc^(l)  <  oo.  (5  -  2) 

N  N  N  N 

(b)  (fc/v(x)}  is  equicontinuous,  and 

(c)  {fljVt  :  1  <  i  <  N,  N  >  1}  satisfies  0  <  a/vi  <  C/N  for  some  C  >  0  and  for  all  i,  N  and 

ajvi  =  1  for  all  N. 
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Remarks. 

(i)  (5-2)  and  (c)  merely  guarantee  that  the  “empirical”  scale  established  by  fcjv(£nLi  amUi)  does 

not  shrink  to  0  or  stretch  to  oo  as  N  varies.  For  example,  if  fc;v(l)  -  fcw(O)  — *  0  as  N  — ►  oo, 
then  fcjv(53nLi  aNiUi)  for  large  N  is  uninteresting. 

(ii)  The  a/v,-  <  C/N  guarantees  that  no  single  item  dominates  the  score;  i.e.,  the  scoring  is 
“balanced” . 


(iii)  A  remark  on  notation  is  appropriate.  An  arbitrary  scoring  method  h\r(U y)  assigns  a  score 
to  each  test  response  U y  and  hence  h^(-)  is  a  function  with  an  iV-dimensional  domain 
(such  a  score  occurs  in  (4-11)).  By  contrast,  an  equicontinuous  balanced  scoring  method 
kw(YliLi  aNiUi )  assigns  a  score  to  each  linear  combination  a^,Ui  for  each  N  and  hence 
fc^(-)  is  a  function  with  a  unidimensional  domain. 


A  fundamental  result  of  “long”  test  theory  is  that  of  a  test  {Ui,i  >  1}  is  essentially  unidi¬ 
mensional,  consistent  estimation  of  9  is  possible  in  the  sense  that  for  any  equicontinuous  balanced 
scoring  method,  given  Qg  =  9 , 


kN  (Y^dNiUig'j  -  kfj  ^ ZaN'Ti(9 )j 


(5-3) 


in  probability  as  N  — ►  oo,  for  g  =  1,2  (established  by  a  minor  modification  of  the  proof  of  Theo¬ 
rem  3.2  in  Stout  (1989)).  That  is,  9  is  estimated  with  total  accuracy  in  the  limit,  using  the  latent 
scale 

ku 


\i=  i 


Here  T,(0)  denotes  the  marginal  item  response  function  defined  by  Xi(0)  =  £[P,(.Q)|Q  =  9\.  Ex¬ 
pectation  is  over  both  groups  here;  that  is,  0  is  the  target  ability  of  a  randomly  chosen  examinee 
from  the  pooled  group  resulting  from  combining  the  two  groups.  An  important  special  case  is  that 
when  dz  =  1,  given  0S  =  9, 

N  N 

1=1  1=1 

in  probability  <ts  AT  -*  oo,  for  g  =  1,2. 

Armed  with  the  above  concepts,  a  “long-test”  definition  of  test  bias  is  now  given.  The  intuitive 
idea  is  that  if  the  test  scoring  method  being  used  measures  target  ability  equally  well  in  both  groups 
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as  measured  by  the  convergence  in  probability  behavior  as  N  — *  oo,  then  no  test  bias  exists.  Let 

U-g  —  {U\g,  .  .  .  Upfg,  t/jV  +  l.g,  .  .  .) 

denote  the  infinite  item  pool  for  Group  g  and  let 

U.Ng  ~  {Ulgi  •  •  •  i  U Ng ) 

denote  the  finite  observed  segment  of  the  item  pool  for  g.  To  study  long-test  test  bias,  we  make  the 
assumption  that  has  a  two  group  representation  of  the  form  (4-4)  with  (3-4),  Assumptions  3.1 
and  3.2  holding  within  each  group  and  with  Assumption  4.1  holding.  It  then  follows  from  the 
ordinary  weak  law  of  large  numbers  in  probability  theory  for  any  equicontinuous  balanced  test 
scoring  method  that,  given  =  9  and  02  =  9, 

aNiUn^j  -  fc/v  a/vi-P.(£))  -*•  0 

and  (5  -  4) 

kN  (E&i  aNiUi2)  -  kN  (l aUtP,(£))  ->  0 

in  probability  as  N  —*  oo.  Here  9  =  ( 9 \t))  where  9  is  the  target  ability  and  tj  is  the  nuisance 
determinant.  Of  course,  in  order  to  be  able  to  assume  local  independence  for  the  representation 
(4-4)  and  have  good  model  fit  the  dimension  d  of  t)  may  need  to  be  quite  large.  It  is  easy  to  show 
(5-4)  also  holds  for  an  d^  essential  dimensional  representation  of  the  form  (4-4),  with  ds  possibly 
much  smaller  than  d. 

Because  kfijiY^iLiGNiUn)  and  kpj(J2iLi  aNiUa)  have  the  same  limit  behavior  in  probability 
(hence  (9,rj)  is  measured  equally  well  in  both  groups),  (5-4)  seems  to  suggest  that  no  test  bias  in 
a  long-test  sense  is  possible.  However,  (5-4)  is  not  the  same  as  group-equivalent  measurement  of 
target  ability  9  alone.  As  in  the  finite  test  length  case  of  Section  4,  the  source  of  bias  is  that  the 
conditional  distributions  of  (77JO1  =  9)  and  (t72|02  =  9)  differ,  thereby  leading  to  superior  limiting 
test  scores  for  one  group  versus  another  given  0}  =  9,  02  =  9.  An  example  should  clarify  this 
claim. 


Example  5.1.  Consider  examinee  subpopulations  from  the  two  groups  defined  by  0j  =  9  and 
02  =  9,  respectively,  i.e.,  both  subpopulations  have  the  same  target  ability.  Suppose  that  there  is 


a  single  nuisance  determinant  and  that 

P[V:  =  II©,  =  9)  =  I  P[V2  =  1\Q2  =  9)=I 

P{7h  =  0|©i  =  9}  =  2  P[7 12  =  0|  ©2  =  9}  =  \ 


(5-5) 
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Clearly  this  is  a  case  of  potential  for  bias  against  Group  1  at  6.  Suppose  fc/v(x)  =  x  for  all  N  and 
=  1/JV  for  all  i  and  N: 

s 

(jT/aNiUi')  =  ^2  Ui/N- 

i=i 

Suppose  local  independence  with  respect  to  (0, 77)  with 


Pi(6,  l)  =  2  p,.(0,O)  =  i 


for  all  i.  Then,  (5-4)  specializes  to 


E&i  K. 


iV 


2  sr=i  ^  _>  2 

3’  N  3 


given  0i  =  0,  tj1  =  1  and  02  =  0,  tj2  =  1,  respectively,  in  probability  as  N  — +  00.  Also 

TliLi  Uii  _  1  HfL  ^  .  1 


IV 


3’ 


given  0i  =  0,  r)x  =  0  and  02  =  0,  7j2  =  0,  respectively,  in  probability  as  JV  — ►  00.  But,  conditioning 
on  0i  =  9  and  02  =  9,  it  follows  using  (5-5)  that 


X^N  u 

11  — ►  |  with  probability^  and 
V'2  — *■  5  with  probability^ 


(5-6) 


as  N  — *•  00,  as  contrasted  with 


— *■  |  with  probability!  and 
— -  — ♦  i  with  probability  ! 


(5-7) 


as  .V  -*  00.  Clearly  Group  2  is  favored  among  examinees  of  target  ability  9.  It  may  be  interesting 
to  note  that 

"S£i  n  1 


IV 


■I©1  =  0 


for  all  IV,  while 


N 


\Q2  =  9 


_  3  1,1  2  _  _5_ 

“  4  '  3  T  4  '  3  ~  12 

-1.Ij_3.2  X 
4  3  T  4  3  12* 


Thus,  in  a  trivial  manner  not  dependent  on  N, 

Y2U  Un 


Lim/v—oo  E 


N 


■|©i  =  # 


-  £ 


^102  =  * 


=  -I<°. 


(5-8) 

(5-9) 

(5-10) 

□ 


We  will  use  the  idea  embodied  in  (5-10)  to  define  large  sample  test  bias. 
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Definition  5.2.  Let  9  denote  target  ability. 

(i)  There  is  no  long-test  test  bias  at  9  with  respect  to  an  equicontinuous  balanced  scoring  method 


provided 


|0i  =  9 


-  E 


i©2  =  0 


-  0 


(5-11) 


as  N  — *  oo. 


(ii)  If  for  every  9,  there  is  no  long-test  test  bias  at  9,  then  there  is  no  long-test  test  bias. 
(Hi)  If  at  9 


N 


kN  \^2aNiUtiJ  |0i  =  9 


-E 


ks  (y^.aNiUiij  |©2  =  0 

for  all  sufficiently  large  N  and  some  C,  then  long-test  test  bias  exists  at  9  against  Group  1. 
□ 


<C<0  (5-12) 


We  first  show  that  if  there  is  no  long-test  test  bias  in  the  empirical  sense  that  among  examinees 
with  the  same  target  ability  9  neither  group  is  favored  in  their  stochastic  test  score  behavior  as 
N  —  oo,  then  long-test  test  bias  in  the  sense  of  Definition  5.2  holds. 

Theorem  5.1.  Suppose,  given  0i  =  9  and  Q2  =  9  that  for  an  equicontinuous  balanced  scoring 
method, 

kN  dNiUi\)  -  c/vi(0)  -*•  0  and 

kN  (E£i  aNiUi2 )  -  2(0)  — *  0 

in  probability  for  some  c/vi(0), <W2(0)>  as  N  —*  00.  Then  (5-11)  holds;  that  is,  there  is  no  long-test 
test  bias  at  9  for  the  given  scoring  method. 


Remark.  Note  that  it  is  not  required  that  the  centering  functions  c^g(9)  have  to  be  the  same  for 
9  —  1,2.  What  is  required  is  the  existencp  of  a  centering  function  dependent  on  9  alone  and  not 
q  for  each  g,  as  contrasted  with  (5-4).  Of  course,  the  case  where  the  centering  functions  are  the 
same  is  of  special  interest  and  is  the  main  motivation  for  the  theorem,  as  the  remark  immediately 
prior  to  the  statement  of  the  theorem  indicates. 
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Proof.  By  (5-2),  |fc;v(x)|  <  C  for  some  C  >  0.  Thus  lcw5(0)|  <  C  +  D  for  some 
Lebesgue  dominated  convergence  theorem  (see  p.  11,  Serfling,  1980),  using  (5-13) 

r  N  1 


E 


53  < iNiUig  -  CjVfl(0)|0fl  =  9 


0 


L»=i  J 

as  N  — *  oo,  for  g  =  1,2.  Now,  trivially,  the  conclusion  (5-13)  holds  given  ©i  = 
02  =  9,  Ji2  =  V  for  all  77.  Thus,  subtracting  the  two  results  in  (5-13), 


D  >  0.  By  the 


(5  -  14) 
0,  77  =  77  and 


CNiW  ~  CN 2(0)  — ►  0 


as  N  -*  00.  Let  c;v(0)  s  cjv i(9).  It  then  follows  from  (5-14)  that 


N 


y.  cLHtUig  -  cn(6)\Q9  =  9 


Li=l 


0 


as  N  — *  00  for  g  =  1,2.  Subtracting  these  two  limits  yields 

[5>;V,tf,2|02  =  0 


r  n 

J2aNiUtx\Qi  =6 
Li— 1 


-  E 


Li=l 


as  N  -*  00,  i.e.,  no  long-test  test  bias  exists  at  9. 


□ 


Remark.  We  claim  that  (5-13),  (and  hence  the  similar  condition  (5-3)),  is  inappropriately  strong 
to  use  as  a  definition  of  lack  of  long-test  bias.  To  see  this,  modify  Example  5.1  by  assuming 

P\%  =  O|0S  =  9}  =  lP[Vg  =  1(0,  =  9]  =  i, 

for  g  =  1,2.  Hence  no  potential  for  bias  exists.  However  note  that,  given  Q\  =  9  and  02  =  9 

^"=A^ V~  ~  3  ~ *  0  probability  5 

and 

JV 

^3  Un9  -  3  — ►  0  with  probability  | 

t=i 

for  both  5  =  1  and  g  =  2.  Thus  (5-13)  is  precluded  and  thus  long-test  bias  would  be  said  to  exist 
(even  though  no  potential  for  bias  exists)  if  (5-13)  was  made  the  basis  for  deciding  on  the  existence 
of  long-test  test  bias.  Note  that  the  above  convergence  in  probability  behavior  is  identical  for  both 
groups.  Intuitively,  in  this  example  the  estimation  of  9  by  YliLi  UNg/N  as  N  —  00  is  equally  bad 
for  both  groups  in  the  sense  that  convergence  in  probability  at  9  fails  to  occur  in  exactly  the  same 
manner  in  both  groups.  Thus  one  would  not  wish  to  claim  that  test  bias  is  occurring. 

The  following  theorem  states  that  essential  unidimensionality  is  a  sufficient  condition  for  ensur¬ 
ing  that  no  long-test  test  bias  exists. 
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Theorem  5.2.  Suppose  dg  =  1  for  target  ability  9  in  the  combined  population  consisting  of 
Group  1  and  Group  2  examinees.  Then,  with  respect  to  all  equicontinuous  balanced  scoring  meth¬ 
ods,  no  long-test  test  bias  exists. 

Proof.  Let  {&n(E£Li  aNiUi)}  be  an  arbitrary  equicontinuous  balanced  scoring  method.  We  need 
to  prove  (5-11)  for  every  8.  Fix  6.  By  work  of  Stout  (1989),  dg  =  1  implies  (5-3)  for  g  =  1,2;  i.e., 
(5-13)  holds  with  c^g(8)  =  fc/v(£a.Vi T,(t?)).  Thus,  by  Theorem  5.1,  the  desired  result  holds.  □ 
By  contrast,  if  the  potential  for  bias  exists  at  0,  then  it  follows  that  there  exist  balanced  scoring 
methods  for  which  long-test  test  bias  at  9  does  exist. 

Theorem  5.3.  Assume  that  IRFs  are  differentiable  in  q.  Let  9  denote  target  ability,  q  denote  the 
nuisance  determinant  and  assume  potential  for  bias  against  Group  1  at  9.  Assume  there  exists  a 
balanced  scoring  method  {a.v«}  (i.e.,  kjv(x)  =  x  in  Definition  5.1)  such  that  at  9. 

d  N 

X^,Pt(0,r?)  >  £„  >  0  (5-15) 

for  all  q  and  all  N .  Then  long-test  test  bias  exists  at  9  against  Group  1. 

Proof.  For  8  =  (9,q),  (5-4)  holds  given  0]  =  8,  tj1  =  q;  02  =  9,  q2  =  q.  Now,  letting  Fg{q\9) 
denote  the  cdf  of  qg\Qg  =  8  and  using  (5-15)  and  integration  by  parts 

£[£«=i  aN,Uxl\Qi  =8}-  E[Z?=1  aNtUt2 102  =  9} 

=  /-°^{E,=  t  av.P.^pMF^I*)  -  F2(q\9)} 

=  -f?oj£Z"=  1  a^M)}^!*)  -  F2(q\9))dq 

<  JToo^iFiiqm  -  F2(q\9)}dq 

<  ~c(9), 

where  c{8)  >  0  by  the  assumption  of  potential  for  bias  against  Group  1.  Since  this  holds  for  all  N , 
the  result  is  proved  by  Definition  5.2.  □ 

How  is  the  finite  test  length  definition  of  test  bias  (Definition  4.6)  related  to  the  long-test  test 
bias  definition  (Definition  5.1)?  The  answer  is  that  lack  of  finite  length  test  bias  for  all  finite  length 
test  £?(v  from  the  item  pool  {£/,,  i  >  1}  implies  lack  of  long-test  test  bias  for  all  equicontinuous 
balanced  test  scores. 

Theorem  5.4.  Assume  an  IRT  representation  for  t  >  1}  of  the  form  (4-4)  for  9=  (8,$).  Let 
{^/v(Ei=i  a,v;t/,)}  be  an  equicontinuous  balanced  scoring  method.  Assume  no  finite  length  test 
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bias  exists;  that  is,  (4-13)  holds  for  all  N .  Assume  regularity  Assumption  4.2.  Then  there  is  no 
long-test  test  bias;  that  is,  (5-11)  holds. 


Proof.  Trivial  from  examination  of  (4-13)  and  (5-11).  □ 

Remark.  Of  course,  long-test  test  bias  holding  is  less  restrictive  than  finite  length  test  bias 
holding.  Nonetheless  it  seems  an  appropriate  way  to  describe  biasedness  of  a  test  when  the  test  is 
long. 

From  the  long  test  perspective,  the  need  to  produce  a  long-test  definition  of  a  valid  subtest  needs 
to  be  addressed.  Previously  in  the  short  test  case,  our  definition  of  a  valid  subtest  5  with  response 
U_  was  stated  to  be  equivalent  to  (4-22)  holding  for  all  (0,7?).  Just  as  the  short-test  version  of  no 
test  bias  ((4-13))  is  modified  for  the  long-test  version  of  no  test  bias  ((5-11)),  a  similar  modification 
of  (4-22)  yields  an  appropriate  definition  of  a  valid  subtest.  We  consider  only  equicontinuous 
balanced  scoring  methods  for  subtests  H'y  of  U_y.  That  is,  we  consider  scoring  k'y(Y)'  ay, U,)  where 
Definition  5.1  holds,  for  each  ky(Y2' aNiUi)  where  denotes  summation  over  the  indices  of  the 
components  of  f/(v. 

Definition  5.3.  Let  the  item  pool  {{/,,:  >  1}  have  IRT  representation  (3-7)  with  the  usual 
accompanying  assumptions.  Let  U)y  C  U_n  denote  a  subtest  of  U_y  for  each  .V.  Denoting  the 
cardinality  of  a  set  A  as  card  (A),  assume 

U1NCU1n+ !,  -rd^~  >  C  >  0  (5-16) 

for  some  C  and  for  all  N  >  No  for  some  fixed  No  (No  will  be  small  in  all  applications).  Then 
{U.'y,N  >  1}  is  said  to  be  a  collection  of  valid  subtests  with  respect  to  a  specified  equicontinuous 
balanced  scoring  method  {kyiY)'  aNiUi)}  provided  there  exists  a  function  cy(&)  such  that  for  all 

E[k'N(£'aN,Ut)  |  0,t?)  =  (0,  r?)]  -  cn(9)  -  0  (5  -  17) 

as  N  — *  oo. 


Remark.  Recall  that  short-test  bias  validity,  i.e.,  (4-22)  hold  for  all  (0,^).  for  scoring  method 
*k(£Vv.r,)  sav,  simply  means  that  for  0 

m(0,Tj)  =  E[ky{Z,'ax,U,)  |  (0,77)  =  (0,jj)] 
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depends  only  on  8  and  not  on  77.  By  contrast  the  long-test  subtest  validity  just  defined  by  (5-17) 
weakens  this  to  asserting  that  771(0,77)  for  all  8  is  asymptotically  not  dependent  on  77  as  N  — »  00. 
That  is,  intuitively,  for  large  fixed  IV,  771(0,77)  for  all  0  is  approximately  constant  as  77  varies. 

As  with  long-test  test  bias,  the  theory  of  essential  unidimensionality  is  useful  in  studying  long- 
test  subtest  validity: 

Theorem  5.5.  Assume  d,£  =  1  with  latent  ability  8  being  target  ability  for  subtests  {J7(v ,  N  >  1} 
satisfying  (5-16).  Then  (5-17)  holds  for  all  equicontinuous  balanced  scoring  methods;  i.e.,  subtest 
validity  holds  for  all  equicontinuous  balanced  scoring  methods. 

Proof.  It  follows  from  a  minor  modification  of  the  proof  of  Theorem  3.2  in  Stout  (1989)  that  for 
all  (8,q) 

k'N(E'a/vtU,)  -  ctv(0)  -»  0  (5  -  18) 

in  probability  as  N  — *  00.  But  apriU;\  <  C  for  some  constant  C  <  00.  It  is  a  standard  result 

from  the  theory  of  convergence  in  probability  that  convergence  in  probability  and  the  boundedness 
just  stated  together  imply  convergence  in  expectation.  That  is,  for  all  (9, 77), 

E[k'N(T,'aNiUt)  |  0,77)  =  8, 77]  -  ci v(0)  —  0 

as  N  -*  00.  I.e.,  (5-17)  holds.  □ 

Stout  (1987)  has  developed  a  statistical  test  for  essential  unidimensionality.  Clearly  this  could 
be  applied  to  a  subtest  to  assess  whether  it  can  be  used  as  a  valid  subtest  in  the  case  of  a  “long” 
test. 

6  Test  Bias  as  a  Function  of  Target  Ability 

Sections  4  and  5  focus  on  test  bias  for  fixed  values  of  target  ability  8.  In  these  sections  it  was 
argued  that  test  bias  (item  bias  also)  is  a  phenomenon  that  expresses  itself  at  each  8.  In  particular, 
it  is  the  comparison  of  the  distributions  of  (^7j|0i  =  9)  and  ( 77^ j 02  =  8)  that  dictates  whether  test 
bias  is  possible  at  9  and  if  such  bias  is  possible,  in  which  direction  ^biased  in  favor  of  or  biased 
against  Group  1)  it  occurs.  Mathematically,  without  further  assumptions,  one  cannot  infer  what 
the  character  of  the  bias  at  9'  ^  8  is  from  the  character  of  the  bias  at  9.  This  section  develops 
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the  concept  of  considering  test  bias  aggregated  over  target  ability.  We  return  to  the  convention  of 
suppressing  N  in  the  notation  when  appropriate;  e.g.  =  £/#. 

Definition  6.1.  Let  h(H)  be  a  test  scoring  method  and  IJ_  be  a  test  response  as  in  (3-1).  The 
expected  test  bias  at  0  against  Group  1  using  test  scoring  method  h(H)  is  given  by 

B(8)  =  E[h(U2)\Q2  =  0]  -  ^hOLi)!©!  ^  0].  (6  -  1) 

o 


Remarks. 

(i)  Note  that  B{9)  >  0  indicates  test  bias  against  Group  1  at  0. 

(ii)  Several  special  cases  are  of  interest.  If  h(u)  =  YliLi  Ui/N ,  then  B(9)  is  the  difference  of 
(marginal)  test  characteristic  curves  (average  of  marginal  IRFs): 

bW - sikMn  & jmi  (6-2) 

If  h(u)  =  tx,-,  then 

B(8)  =  Tt7(8)-Til(9 ), 

the  amount  of  item  i  bias  against  Group  1  at  9. 

Probably  the  most  common  pattern  in  the  potential  for  bias  as  a  function  of  8  is  unidirectional 
potential  for  bias: 

Definition  6.2.  If  potential  for  bias  exists  against  the  same  group  at  every  9  then  unidirectional 
potential  for  bias  is  said  to  exist  against  the  group.  □ 

Another  less  common,  but  still  important  pattern  in  the  potential  for  bias  as  a  function  of  8  is 
that  the  “direction”  of  the  potential  for  bias  changes  from  one  end  of  the  0-continuum  to  the  other: 

Definition  6.3.  Suppose  for  some  fixed  8q  that  the  potential  for  bias  against  one  group  exists  for 
all  8  <  0o  and  the  potential  for  bias  exists  against  the  other  group  for  all  8  >  6q.  Then  bidirectional 
potential  for  bias  is  said  to  exist.  □ 

The  verbal  analogies  example  of  Section  2  is  an  obvious  practical  example  of  unidirectional 
potential  for  bias.  For,  it  seems  likely  that  the  potential  for  test  bias  against  German  immigrants 
will  hold  regardless  of  the  level  of  verbal  analogies  ability  being  conditioned  on. 
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As  an  example  of  bidirectional  potential  for  bias,  suppose  Qj  and  02  are  both  uniformly  dis¬ 
tributed  on  the  interval  [-1, 1].  Suppose  that  in  Group  1,  0i  and  r/x  are  statistically  independent 
with  rj1  uniformly  distributed  on  [-1, 1].  Suppose  in  Group  2  that  (r72|02  =  0)  has  a  uniform  dis¬ 
tribution  on  the  interval  with  end  points  0  and  29.  That  is,  perhaps  because  of  cultural  differences, 
in  Group  2  it  follows  that  0  and  tj  are  highly  positively  correlated  while  ©  and  77  are  uncorrelated 
in  Group  1.  Elementary  computation  show  that  if  -1  <  6  <  0,  (4-10)  holds,  yet  if  0  <  6  <  1,  (4-9) 
holds.  That  is,  potential  for  bias  against  Group  2  holds  for  6  <  0  and  potential  for  bias  against 
Group  1  hold  if  9  >  0;  i.e.,  bidirectional  potential  for  bias  holds. 

Test  bias  (and  item  bias)  can  be  undirectional  or  bidirectional. 

Definition  6.4.  If  test  bias  (either  in  the  ordering  sense  of  Definition  4.6  or  in  the  long-test  sense 
of  Definition  5.2)  exists  against  the  same  group  at  every  6,  then  unidirectional  test  bias  against 
that  group  is  said  to  hold. 

Definition  6.5.  If  for  some  60  test  bias  in  the  sense  of  Definition  4.6  holds  against  one  group  for 
all  0  <  9 0  and  against  the  other  group  for  all  9  >  60  then  bidirectional  test  bias  is  said  to  occur.  □ 

A  long-test  version  of  Definition  6.5  is  easy  to  give  but  is  omitted  for  simplicity.  The  following 
results  relate  unidirectional  potential  for  bias  to  unidirectional  test  bias. 

Theorem  6.1.  Suppose  test  bias  exists  against  Group  1  at  some  9  in  the  sense  of  Definition  4.6, 
and  suppose  unidirectional  potential  for  bias.  Assume  a  test  scoring  method  of  the  form  (4-11). 
Suppose  for  every  6'  that  there  is  some  i  ( possibly  dependent  on  6')  for  which  h(u)  is  strictly 
increasing  as  u,  =  0  increases  to  ui  =  1  and  for  which  P,(9' ,tj)  is  strictly  increasing  in  77.  Then 
unidirectional  test  bias  against  Group  1  holds. 

Proof.  By  Theorem  4.3,  the  potential  for  bias  against  Group  1  at  6  holds.  By  assumption  of 
unidirectional  potential  for  bias,  the  potential  for  bias  against  Group  1  thus  holds  for  all  6'.  Apply 
Theorem  4.2  together  with  the  remark  (i)  following  it.  □ 

Theorem  6.2.  Assume  IRFs  are  differentiable  in  tj.  Suppose  long-test  test  bias  exists  against 
Group  1  at  some  9  in  the  sense  of  Definition  5.2  for  a  balanced  scoring  method  {n.v, }  and  suppose 
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unidirectional  potential  for  test  bias.  Assume  for  every  O' 

d  N 

—  Y;aN>P,(d\v)  >  (r,  >  0 

for  all  T)  ( without  loss  of  generality  assumed  unidimensional  here).  Then  unidirectional  ( long-test ) 
test  bias  against  Group  1  holds  in  the  sense  of  Definition  6.4. 

Proof.  Same  as  that  of  Theorem  6.1  except  Theorem  5.3  is  used  in  place  of  Theorem  4.2.  □ 

In  order  to  study  bidirectional  test  bias,  attention  is  restricted  to  balanced  scoring  methods. 
For  an  arbitrary  balanced  scoring  method  YltLi  aNiUi,  letting 

Fg{r)\d)  =  P[vg  <  r?|0s  =  6] 

and  assuming  differentiability  of  IRFs  and  a  unidimensional  nuisance  determinant,  the  following 
formula  for  B(9 )  of  (6-1)  obtained  by  integration  by  parts  is  useful 

m  =  /_“  (6-3) 

Theorem  6.3.  Assume  a  balanced  scoring  method  with  differentiable  IRFs.  Assume  a  unidi¬ 
mensional  nuisance  trait  tj.  Assume  for  each  9,  there  exists  some  i  (possibly  varying  with  9)  for 
which 

am  >  0,  Pi(9 ,  r?)  >  0  for  all  r]  >  0.  (6  -  4) 

drj 

Then  bidirectional  potential  for  test  bias  holds  if  and  only  if  bidirectional  test  bias  holds. 

Proof.  By  Assumption  4.2,  for  fixed  9  either 

^lC7?!^)  -  FM??!#)  >  0  for  all  *7  (6  -  5) 

or 

FiC7?!#)  -  FM*?!#)  <  0  all  rj.  (6  -  6) 

Thus,  using  (6-3),  (6-4)  and  the  strict  monotonicity  of  every  Pi(9,  p)  in  rj,  B(9 )  >  0  or  B{9)  <  0 
accordingly  as  (6-5)  or  (6-6)  holds.  Potential  for  bias  at  9  means  that  either  (6-5)  or  (6-6)  holds  at 
9.  The  desired  result  follows.  □ 

Assume  number  correct  scoring,  which  implies  (6-2)  and  hence  that  test  bias  is  controlled  by 
the  (marginal)  item  response  functions  with  respect  to  target  ability.  Graphically,  bidirectional 
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test  bias  under  this  scoring  method  is  shown  in  Figure  3.  Note  the  effect  is  that  the  test  displays 
higher  discrimination  for  Group  2  than  for  Group  1.  That  is,  bidirectional  test  bias  is  expressed  as 
differing  test  discriminations  for  the  two  groups.  By  contrast,  under  (6-2),  unidirectional  test  bias 
is  shown  in  Figure  4.  Unidirectional  test  bias  is  not  linked  to  differing  test  discriminations  across 
group.  Indeed  the  two  test  characteristic  curves  shown  in  Figure  4  cam  even  be  translates  of  one 
another;  e.g.,  for  some  c  >  0  for  given  Ti2(9)  =  T,(0) 

'£Ttl(6)/N  =  jrT,(e  +  c)/N  . 

i=i  i=i 

for  all  9.  That  is,  items  could  be  uniformly  more  difficult  for  Group  2  examinees  at  every  9. 

There  is  a  debate  about  whether  from  the  cognitive  perspective,  differing  discriminations  across 
group  is  more  the  essence  of  bias  than  differing  difficulties  across  group.  Also,  some  practitioners 
claim  that  bidirectional  test  bias  can  be  important  in  practice  while  others  discount  its  importance. 
It  is  hoped  that  Section  6  helps  illuminate  these  issues. 

7  Discussion  and  Summary  of  Results 

The  central  position  of  this  paper  is  that  bias  should  be  conceptualized,  studied,  and  measured  at 
the  test  level  rather  than  at  the  item  level.  A  multidimensional  but  non-parametric  IRT  model 
of  test  bias  is  presented  and  a  number  of  important  properties  derived.  Our  theory  of  test  bias 
includes  the  often  used  unidimensional  IRT  bias  approach  as  a  special  case. 

The  model  hypothesizes  a  target  ability  intended  to  be  measured  by  the  test  as  well  as  other 
dimensions  called  nuisance  determinants ,  not  intended  to  be  measured.  Informally,  test  bias  occurs 
when  the  test  under  consideration  is  measuring  nuisance  determinants  in  addition  to  the  target 
ability,  and  moreover  the  two  groups  do  not  possess  equal  amounts  of  the  nuisance  determinants. 
Our  view,  an  outgrowth  of  the  classical  predictive  validity  viewpoint  of  bias,  is  that  bias  is  really 
something  expressed  at  the  test  level  via  the  particular  test  score  in  use  and  that  bias  rests  in  the 
across-group  differences  in  the  relationship  between  test  scores  and  criterion.  For  us  the  “criterion” 
is  internal  to  the  test  and  is  expressed  by  a  “valid”  subtest  known  to  consist  of  items  measuring  only 
target  ability.  In  order  to  statistically  detect  test  bias,  a  valid  subtest  must  exist  and  be  identified. 

In  Section  3,  the  multidimensional  non-parametric  IRT  model  is  presented.  The  notion  of  the 
marginal  IRF  with  respect  to  target  ability  is  introduced. 
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Figure  4: 


In  Section  4,  test  bias  is  carefully  defined  using  the  IRT  model  introduced  in  Section  3.  Test 
bias  originates  with  the  potential  for  test  bias  at  a  particular  value  of  9  of  target  ability  existing 
against  a  group  in  the  sense  of  Definition  4.3.  This  potential  for  bias  against  Group  1  gets  expressed 
at  the  item  level  if  any  of  the  marginal  group  IRFs  satisfy  T,i(0)  <  T^#)-  The  potential  for  test 
bias  and  a  strictly  increasing  IRF  in  77  implies  expressed  item  bias  (Theorem  4.1). 

The  main  focus  of  this  paper  is  on  biased  items  acting  in  concert.  Three  components  combine  to 
produce  test  bias:  (a)  potential  for  bias,  (b)  dependence  of  the  IRFs  on  77,  and  (c)  the  test  scoring 
method,  which  transmits  simultaneous  expressed  item  bias  into  test  bias.  Test  bias  is  formally 
defined  in  (4-12).  It  is  shown  that  test  bias  at  8  implies  the  potential  for  bias  at  9  (Theorem  4.3). 
The  central  result  of  Section  4  (Theorem  4.2)  shows  that  potential  for  bias  at  9  translates  into  test 
bias  at  9  provided  the  scoring  method  depends  on  at  least  one  item  that  has  a  strictly  increasing 
IRF  in  Tj_  at  9. 

The  important  topic  of  item  bias  cancellation  is  taken  up  in  Section  4.4.  Example  4.1  illustrates 
how  cancellation  can  actually  decrease  the  amount  of  item  bias  that  gets  expressed  at  the  test  level. 
That  is,  the  potential  for  bias  need  not  be  strongly  transmitted  to  the  test  level  because  in  fact 
considerable  cancellation  can  occur  as  the  result  of  multidimensional  nuisance  determinants.  By 
contrast,  small  and  perhaps  undetectable  amounts  of  bias  at  the  item  level  can  be  translated  into 
a  substantial  amount  of  bias  expressed  at  the  test  level  when  no  cancellation  occurs.  Section  4.5 
formalizes  the  notion  of  a  valid  subtest,  which  must  exist  for  text  bias  to  be  detected.  Shealy  and 
Stout  (1990)  present  a  statistical  test  of  test  bias,  making  the  question  of  whether  test  bias  does 
exist  for  a  particular  data  set  an  answerable  one. 

Section  5  presents  a  long-test  viewpoint  of  test  bias,  making  heavy  use  of  Stout’s  theory  of  essen¬ 
tial  unidimensionality.  No  long-test  test  bias  holding  is  defined.  It  is  shown  that  if  an  equicontinuous 
balanced  test  score  (a  large  class  of  reasonable  to  use  test  scores  are  such)  displays  appropriate  con¬ 
vergence  in  probability  behavior  separately  in  each  examinee  group,  then  there  can  be  no  long- test 
test  bias.  Essential  unidimensionality  (dg  =  1)  of  a  test  with  target  ability  as  the  latent  trait 
is  shown  to  exclude  long-test  test  bias.  Because  one  can  statistically  test  for  essential  unidimen¬ 
sionality  (Stout,  1987),  this  is  a  potentially  very  useful  result.  Theorem  5.3  is  important  as  the 
long-test  analogue  to  Theorem  4.2.  It  links  potential  for  bias  and  scoring  method  to  the  existence 
of  long-test  test  bias. 
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A  long-test  viewpoint  of  subtest  validity  is  also  present  in  Section  5.  Informally  stated,  the 
main  result  is  that  dg  =  1  for  a  subtest  with  the  latent  trait  being  target  ability  implies  subtest 
validity  for  all  equicontinuous  balanced  scoring  methods. 

Section  6  considers  test  bias  aggregated  over  target  ability.  The  important  concepts  of  unidirec¬ 
tional  and  bidirectional  test  bias  are  introduced.  The  relationship  between  differing  discriminations 
across  group  and  bidirectional  test  bias  is  explicated. 

It  is  hoped  that  the  above  theory  of  test  validity  proves  useful  to  theoreticians  and  practitioners 
alike. 

Acknowledgement.  The  authors  found  discussions  with  Terry  Ackerman,  Paul  Holland,  Lloyd 
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